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Abstract 

Background: The first generation of genome sequence assemblies and annotations have had a significant impact 
upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, 
the study of populations within and across species, and have informed the biology of humans. As only a few 
Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for 
improvement of most genome assemblies. The honey bee {Apis mellifera) genome, published in 2006, was noted 
for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer 
genes in the initial gene set (OGSvl.O) compared to what would be expected based on other sequenced insect 
genomes. 

Results: Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set 
(OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect 
genomes, contrary to what was suggested in OGSvl.O. The new genome assembly is more contiguous and 
complete and the new gene set includes -5000 more protein-coding genes, 50% more than previously reported. 
About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred 
based on new RNAseq and protein data. 

Conclusions: Lessons learned from this genome upgrade have important implications for future genome 
sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, 
a key model for social behavior and essential to global ecology through pollination. 

Keywords: Apis meilifera, GC content. Gene annotation. Gene prediction. Genome assembly, Genome improvement. 
Genome sequencing. Repetitive DNA, Transcriptome 
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Background 

Honey bee, Apis mellifera, provides a key model for un- 
derstanding the diverse and widespread group of social 
insects. Thanks, in part, to resources generated during 
the initial honey bee genome sequencing effort [1], 
honey bees are now widely used for elucidating the mo- 
lecular basis of behavior [2], development [3], disease 
transmission [4], epigenomics [5] and gene regulation 
[6]. Honey bees are also a vital pollinator of many of the 
world's crops [7,8], and genomic tools are being used to 
address recent serious and unexplained declines in some 
honeybee populations [9,10]. In light of the current 
interest in honey bee genetics, improved genomic tools 
for this species are required. 

Because fewer than 11,000 genes were predicted, one 
of the major questions coming out of the initial honey 
bee genome sequencing effort was, "Where are the miss- 
ing genes?" Were there issues with the genome sequence 
that meant that the genes were filled with gaps and 
therefore not annotated? Or were the genes located in 
islands of repetitive sequence and therefore not identi- 
fied? Were the gene prediction methods poorly adapted 
to a genome with a bimodal distribution in GC content? 
Or was the tuning of gene prediction algorithms to an 
average GC content failing to represent genes in regions 
of more extreme GC content? Was there insufficient 
mRNA evidence? Were the genes different enough from 
known genes that the prediction tools failed to find the 
genes due to lack of protein evidence? Or might the 
honey bee have thousands fewer genes than the few 
insects with sequenced genomes at the time? 

All early genome sequencing projects used Sanger data 
and either a BAC-based hierarchical-sequencing model 
[11] or a whole-genome-shotgun model [12]. In either 
case, a completed draft genome has contiguous sequence 
pieces (contigs) spaced by paired clone end sequences, 
forming scaffolds. Missing sequences between consecu- 
tive contigs are represented by segments of ambiguous 
bases, denoted as Ns, the lengths of estimated gap sizes. 
Few Metazoan genomes have been improved to finished 
quality by filling in the missing sequence [13-16]. The 
genome scaffolds, with islands of sequence information 
and lack of information (in a draft genome), are anno- 
tated with gene features using gene prediction methods. 
When gene signals such as splice sites, start codons and 
termination codons are missing from the assembly, a 
computational gene prediction tool may miss an exon or 
even an entire gene. Thus, gaps within genie regions of 
an assembly can hinder annotation and lead to an in- 
complete gene list. 

The first A mellifera genome sequencing project [1] re- 
vealed genome characteristics with potential missing as- 
sembly information that could impact the gene list. The 
genome assembly had the lowest mean GC content 



(percent of G + C nucleotides) and the most heterogeneous 
GC content of any sequenced metazoan genome at that 
time, with GC content ranging from 10% to 70% across the 
published genome assembly (Amel_4-0) [1]. Analyses of ini- 
tial assemblies showed that regions with low GC content 
were under-represented in libraries, so additional shotgun 
libraries were generated after fractionating DNA with CsCl- 
bisbenzimide density gradient centrifugation, and appro- 
ximately 800,000 reads of <30% GC content were added to 
the data that lead to Amel_4.0 [1]. 

In addition to potentially missing assembly informa- 
tion, other factors including limited EST data and the 
large evolutionary distance between honey bee and other 
insects with sequenced genomes were thought to have 
contributed to difficulty in gene prediction. The 78,001 
A. mellifera ESTs that were available at the time pro- 
vided only 9,408 unique consensus sequence alignments 
to the genome [17]. The closest organisms with available 
protein sets were the Dipterans, Drosophila melanogaster 
and Anopheles gambiae, with an estimated divergence 
time from honey bee (Hymenopteran) of 300 million 
years. 

Honey bee researchers suspected that the first Official 
Gene Set (OGSvl.O), generated by creating a consensus 
gene set with GLEAN [1,17], was incomplete, as it com- 
prised only 10,157 genes, far fewer than expected based 
on the estimate of ~13,600 genes in D. melanogaster at 
the time. Recently sequenced Hymenoptera genomes 
(parasitoid wasp and seven ant species), predicted to 
contain -17,000-18,500 protein-coding genes [18-24], 
further support the suspicion that a large number of A. 
mellifera genes had not yet been detected. Whole ge- 
nome tiling arrays provided experimental evidence for 
missing genes, with the detection of signals "intergenic" 
to OGSvl.O genes [1]. Finally, evaluation of the genome- 
wide distribution of OGSvl.O genes revealed another po- 
tential factor in computational gene prediction, which 
relied on algorithms optimized for gene and genome 
characteristics known at that time. Unlike any previously 
sequenced metazoan genome, OGSvl.O genes occurred 
in gene regions that were more GC-poor (29% GC) than 
the mean genome GC content (33%), and were located 
in regions with GC content as low as 11% [1]. 

Genome sequencing and assembly methods have chan- 
ged with the advent of next-generation sequencing tech- 
nologies. Newer technologies typically produce shorter 
reads at greatly reduced cost and allow the completion of 
projects with much deeper sequence coverage using many 
more reads than a Sanger-based project. While contiguous 
sequence may suffer from these shorter reads that cannot 
span repetitive sequences, scaffolding can benefit from the 
increased mate-pair information. Furthermore, different se- 
quencing technologies have different biases related to nu- 
cleotide composition, so combining technologies provides 
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the potential for one technology to fill in gaps produced by 
another. 

A bigger impact of the reduced sequencing cost of 
second-generation sequencing methods is the ability to 
generate much more transcript sequence than ever be- 
fore. These transcript data are very useful as evidence 
supporting gene model prediction. As will be discussed 
below, the alignment of new transcriptome data pro- 
vided evidence for most of the previously un-annotated 
genes in the honey bee genome. 

This paper presents a de novo annotation of the A. 
melUfera genome based on an upgraded genome assem- 
bly and new transcript data. We generated additional 
genome sequence data using ABI SOLID and Roche 454 
paired-end sequencing technologies to superscaffold and 
fill gaps in the Amel_4.0 assembly using the Atlas-link 
software, creating the improved assembly (Amel_4.5). 
We also deep-sequenced the transcriptome of seven A. 
mellifem tissues and sequenced the genomes of two 
closely related species, the dwarf honey bee {A. florea) 
and buff-tailed bumble bee {Bombus terrestris) (both of 
which will be published elsewhere). 

To avoid perpetuating errors in OGSvl.O and avert dif- 
ficulties in tracking genes between assemblies (e.g. re- 
ported in [25]), we chose not to track and update the 
previous annotation on the Amel_4.5 assembly. Instead 
we decided to annotate the Amel_4.5 assembly de novo 
and generate a new official gene set (OGSv3.2) and a 
new list of repetitive elements based on the most current 
evidence and methodologies. To learn which factors 
were most important in identifying previously unknown 
genes, and to inform potential strategies for future ge- 
nome projects, we compared both gene sets and charac- 
terized genes that were missing from OGSvl.O. 

Results 

Improvements to the genome assembly 

Additional sequence coverage generated using the ABI 
SOLID [26] technology (~20x) and Roche 454 [27] tech- 
nology (~4x) from small insert fragments (~2 k) was in- 
corporated into Amel_4-5. The sequence data are 
summarized in Table 1 and Table SI in Additional file 1 
and are available from the NCBI Sequence Read Archive 
(SRA). We compared assembly statistics for Amel_4.5 



Table 1 Additional sequence data for the improved 
honey bee genome 



Mate-pair 


Roche 454 


Roche 454 


SOLiD 


distance 


fragment 


2.75 kb 




Number of reads 


2.9 M 


5.24 M 


90 M 


Read length 


290 bp 


56 bp 


50 bp 


Pairs 


No 


Yes 


Yes 


Sequence coverage 


3.6x 


1.3x 


20x 



with the previous assembly, Amel_4 0, which is de- 
scribed in [1] and is the assembly used by the research 
community since 2006. The genome assembly improve- 
ments increased the contig N50 from 40 kb to 46 kb 
and the scaffold N50 from 359 kb to 997 kb, with an 
additional 5.5% of the sequence anchored to linkage 
groups (Table 2). 

The completeness (extent of coverage of the genome) of 
the new Amel_4.5 and the earlier, Amel_4-0, assemblies 
were measured by BLASTN comparison to available de 
novo assembled 454 contigs from seven libraries. Amel_4-5 
showed slightiy more complete coverage than Amel_4-0 
(88.7% vs. 88.2%; Table 3). 

As hoped, the new sequence data had a large impact on 
the regions of the genome that were of low starting quality. 
While the new sequence did not specifically target the GC- 
poor regions of the genome that were underrepresented in 
the initial Sanger libraries, the sequence reads were more 
evenly distributed across the regions of different GC con- 
tents (Figure 1). A segmentation analysis of Amel_4-5 to 
identify GC compositional domains that are homogenous 
in GC content within domains, but different in GC content 
across domain boundaries, showed that the genomic pro- 
portion of compositional domains low in GC content in- 
creased in Amel_4.5 (Figure 2). As a consequence, many 
newly discovered genes were annotated in GC poor regions 
(Figure 2). 

Generating and assessing a New official gene Set 
Testing combined gene set approaches 

We tested two approaches, MAKER2 [28] and GLEAN 
[17], for combining outputs from multiple gene prediction 
tools to create an Official Gene Set (OGS). Each approach 
has advantages and disadvantages. An advantage of 
MAKER2 is that given sufficient transcript data, it can pre- 
dict multiple alternative splice forms per locus, while 
GLEAN computes only one gene model per locus. 
MAKER2 provides both a conservative biological sequence 
(transcript and homolog) alignment-based set of gene 
models that best agree with transcript and protein homolog 



Table 2 Assembly statistics for the improved honey 
bee genome 



Assembly 




Number 


N50 

(kb)= 


Bases + Gaps 
(kb) 


Bases 
(kb) 


Amel_4.5 


Anchored 


340 


1,209 


203,000 


200,000 




Scaffolds 


5,644 


997 


250,271 


229,734 




Contigs 


16,501 


46 


229,734 


229,734 


Amel_4.0 


Anchored 


626 


621/135 


217,195 


183,323 




Scaffolds 


10,742 


359 


315,719 


231,029 




Contigs 


18,944 


40 


231,029 


231,029 



^For Amel_4.0 Anchored scaffolds, N50 was calculated separately for 320 
oriented and 306 non-oriented scaffolds. 
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Table 3 Assembly comparison to 454 transcriptome data 



Tissue 


Total 
assembled 
transcripts 


Amel 


_4.S 


Ame 


_4.0 


Number 


Percent 


Number 


Percent 


A KH n m P n 


14514 


1 3 980 


95 6% 


1 3 987 


95 7% 


Brain + ovary 


27,412 


25,341 


95.0% 


26,342 


95.0% 


Embryo 


19,616 


18,555 


94.6% 


18,565 


94.6% 


Larvae 


1 8,050 


9,051 


50.1% 


9,041 


50.0% 


Mixed antennae 


14,891 


1 3,858 


93.1% 


13,865 


93.1% 


Ovary 


28,451 


27,500 


96.6% 


26,929 


94.6% 


Testes 


10,557 


9,234 


87.4%) 


9,060 


85.8% 


Total 


133,591 


1 1 8,549 


88.7% 


1 1 7,789 


88.2% 



Note: BLAT alignments of the assembled transcripts to the genome assemblies 
using default parameters and counting matches of any length at 95% identity. 



alignments, and a set of ab initio gene predictions that do 
not overlap the biological sequence alignment-based set. 

The final set of MAKER2 biological sequence alignment- 
based gene set contained 12,268 genes, encoding 12,575 
transcripts, with an additional set of 31,047 ab initio gene 
predictions that did not have support from transcript or 
protein alignments. Without filtering the ab initio set for 
likely false positives, the choices of final gene sets generated 
with MAKER2 would have been 1) a conservative set of 
only 12,268 genes, still much lower than gene numbers 
found in other sequenced insect genomes or 2) a set of 
43,315 genes that included ab initio gene models. Other 
hymenoptera genome projects using MAKER have selected 



ab initio gene models with matches to InterPro domains 
[29] to include in the final predicted gene set [20-22]. This 
approach would allow us to add 1,215 MAKER2 ab initio 
genes to the final predicted gene set, for a total of 13,303 
genes. However, since only 8,602 of the 12,268 MAKER2 
biological sequence alignment-based genes had InterPro 
matches, we were concerned that the InterPro approach to 
boost ab initio genes to the final gene set would miss some 
rapidly diverging genes. 

An important difference between GLEAN and MAKER2 
is the way ab initio predictions without biological sequence 
evidence are handled. GLEAN collects evidence for genes {ab 
initio gene predictions and biological sequence alignment- 
based), generates maximum likelihood estimates of accuracy 
and error rates for these signals for each gene predic- 
tion set, constructs consensus gene models that maximize 
the scores for the signal sites in each gene model, and la- 
bels each model with a probabilistic confidence score 
reflecting the underlying support for that gene model [17]. 
In addition to identifying consensus models supported by 
transcript and protein alignments, GLEAN identifies ab 
initio predictions with high computed probabilities, filter- 
ing out likely false positive ab initio predictions. This ap- 
proach would not require filtering ab initio predictions 
using an InterPro search, and would potentially include 
high confidence ab initio predictions that are true, rapidly 
diverging genes without significant InterPro domain 
matches. 

Since GLEAN weighs different sources of gene predic- 
tion evidence based on agreement among sources, 



454 reads mapped region AT percentage distribution 




Figure 1 Distribution of mapped 454 reads witli respect to AT content. The genomic reads were mapped to the Amel_4.5 assembly 
(scaffolds and contigs) using BLAT. With relatively stringent filtering (at least 80% of total length matched and gap size < 30%)), 242,284 reads (93% 
of all reads) were aligned to the assembly. Most reads (236,090, 93%) aligned to fewer than 10 locations, and had unique alignments (210,625, 
87%)). The AT content for each alignment (adding 10% extension on either end) was calculated for reads with < 10 match locations. 
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20 30 40 50 

%GC 

Figure 2 GC content of genie regions and overall genome 
assemblies. For each gene, the GC content (percent G + C 
nucleotides) of genomic regions containing the gene was 
determined as described in methods. The cumulative distributions of 
GC content for overall genome assemblies (thin red line for Amel 4.5 
and thin black line for Amel_2.0) show that the Amel_4.5 assembly 
has a higher fraction of low GC content regions than does the 
Amel_2.0 assembly (note the thin red line is to the left of the thin 
black line below about 28% GC). The cumulative distributions of GC 
content for the regions containing genes (thick red line for all 
OGSv3.2, thick green line for Previously known genes, thick blue line 
for Type I New genes and thick pink line for Type II new genes) 
show that regions containing genes are lower in GC content than 
the overall genome. This trend applies for the complete set of 
OGSv3.2 genes, as well as the three subsets. The distribution for 
Type I New genes lies to the left of the other distributions, showing 
that Type I New genes are located in lower GC content regions than 
the other gene subsets. The distribution for Type II new genes is to 
the right of the distributions for the other gene subsets, showing 
that the Type II New genes are located in higher GC 
content regions. 



without a priori information on the quahty of individual 
input gene prediction sets, the choice of input datasets 
impacts the results. Evaluating GLEAN runs with differ- 
ent combinations of input sets allowed us to determine 
the optimum selection of datasets. We ran GLEAN 32 
times with different combinations of input sets, and 
resulting GLEAN sets ranged in size from in 10,403 
(with three input gene sets) to 17,724 genes (with seven 
input gene sets), (Additional file 2). In some cases, 
GLEAN predicted a larger number of genes than the 
MAKER2 set supported by either biological sequence 
alignment or InterPro. Although GLEAN generates only 
one transcript per gene locus, we decided to use GLEAN 
to create the OGS, because we prioritized GLEAN's 
potential identification of a more comprehensive set 
of gene loci over MAKER2's identification of 307 



alternative transcripts. Since with GLEAN we did not 
need to filter ab initio genes based on InterPro, it might 
allow the genome-conservation-based N-SCAN dataset 
to provide support for other ab initio predictions even 
without detectable conservation with known InterPro 
domains. Another consideration was that assembly 
methods for lUumina RNASeq were not well established. 
We were concerned about the possibility of merging 
genes or missing introns due to potential genomic con- 
tamination in RNASeq experiments. GLEAN uses tran- 
script alignments to support splice predictions, but does 
not merge or split genes based on transcript alignments, 
while MAKER2 creates gene models that best agree 
with transcript alignments, and works best with highly 
reliable transcript assemblies. 

Selecting an official gene set 

We evaluated the 32 GLEAN sets based on several cri- 
teria, including overlap with a conservative evidence- 
based set (RefSeq), transcript sequences, peptides and the 
CEGMA [30] conserved core set (Additional file 2). No 
single gene set was optimal with respect to all criteria. We 
chose to rank sets based on number of peptide matches, 
which would prioritize completeness of a protein-coding 
gene set rather than correctness of gene structure. 

Assessing the new official gene set 

The selected GLEAN set, OGSv3.2 (GLEAN31 in Additional 
file 2), represented a significant improvement because 
it included 15,314 protein-coding genes, which is 5,157 
more genes than the first official gene set, OGSvl.O. 
The proportion of genes on placed scaffolds as well as 
those with expressed sequence coverage also increased 
over OGSvl.O (Table 4). Since GLEAN predicts only 
coding exons, but not untranslated regions (UTRs), we 
used MAKER2 [28] to add UTR to the final GLEAN 
gene predictions. Out of a total of 15,314 OGSv3.2 
genes, UTR were added to 7,514 genes (49%). 

Many split and merged gene models were noted when 
comparing the 32 GLEAN sets, including OGSv3.2, to the 
conservative RefSeq gene set (Additional file 2). Since it is 
difficult to computationally resolve disagreements among 
gene sets, the OGSv3.2 genes in question are provided as a 
separate track on the BeeBase Amel_4.5 genome browser 
[31]. Web Apollo [32] annotation tools allow registered 
BeeBase users to modify gene models, and can be used to 
manually correct split or merged genes [33]. 

To address the question of whether the increased gene 
number was due to splitting genes, we looked at the 
total number of coding nucleotides (nt) and the distri- 
bution of coding sequence lengths in OGSvl.O and 
OGSv3.2. The ranges in coding sequence lengths were 
24-53,649 and 75-70,263 nt for OGSvl.O and OGSv3.2, 
respectively. Note that the gene prediction parameters 
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Table 4 Comparison of OGSvLO and OGSv3.2 





OGSvl.O 


OGSv3.2 


Number of genes 


10,157 


15,314 


Number of genes within mapped scaffolds (% of total no. of genes) 


5,973 (58.8%) 


1 3,285 (86.8%)) 


Number of genes within un-mapped scaffolds {% of tota no of genes) 


4 J 84 (41.2%) 


2,029 (13.2%)) 


Average coding sequence length (bp) 


1,623 


1,266 


Average number of coding exons 


64 


5.3 


Number of single coding exon genes (% of total no. of genes) 


795 (7.8%) 


2,059 (13.4%)) 


Number of multi-coding exon genes (% of total no. of genes) 


9,362 (92.2%) 


1 3,255 (86.6%)) 


Number of genes with spliced EST coverage {% of total no. of genes) 


3,039 (29.9%) 


12,172 (79.5%)) 


Number of genes with un-spliced EST coverage (% of total no. of genes) 


1,734 (17.1%)) 


11,019 (72%) 


Number of genes that overlap a protein alignment (% of total no. of genes) 


7,940 (78.2%)) 


6,778 (44.3%)) 



did not allow coding sequences less than 75 nt for 
OGSv3.2, but a minimum coding sequence parameter 
was not applied in the generation of OGSvl.O; however 
there were only 6 OGSvl.O genes less than 75 nt. The 
OGSv3.2 gene with the largest coding sequence over- 
lapped a single RefSeq gene prediction (XM_623650.3), 
and was found to be homologous to genes predicted to 
code for titin. The total number of coding nucleotides 
increased from 16,484,776 in OGSvl.O to 19,342,383 in 
OGSv3.2. The increase in coding nucleotides by only 
2,857,607 (17.3%) compared to a 50.8% increase in the 
number of genes suggested that new genes were shorter 
and possibly a result of spUtting OGSvl.O gene models. 
However, the distribution of coding sequence length in 
the two gene sets suggested that splitting genes was not 
the major source of the increased gene number (Figure 3). 
Although OGSv3.2 possessed a large number of short 
coding sequences (< -1.5 kb) compared to OGSvl.O, the 
larger genes did not appear to be missing. Furthermore, 
analyzing overlapping gene alignments (described below), 
showed that 4,735 of the OGSv3.2 genes did not overlap 
with genes in OGSvl.O, and thus were not likely to be a 
result of splitting OGSvl.O genes. 

Biological evidence for the new official gene set 

In assessing the gene set, we define biological evidence 
to include biological sequence alignments used in gene 
prediction pipelines (transcript, peptide, protein homo- 
log) as well as matches to InterPro domains and align- 
ment to other bee genomes. We define biological gene 
evidence as all of the datasets considered as biological 
evidence, except alignment to other bee genomes, since 
we did not use information about genes or expression 
for the other bee genomes. Some form of biological gene 
evidence supported the majority (14,084, 92%) of the 
OGSv3.2 genes. An additional 754 (4.9%) OGSv3.2 genes 
aligned to either the Apis florea or Bombus terrestris 



genome for a total of 14,836 (96.9%) of the OGSv3.2 
genes with biological support. Requiring that transcripts 
be spliced reduces the number of supported genes to 
13,264 (86.8%) or 14,661 (95.7%) if alignment of the 
gene to another bee genome is considered as support. 

Identification and characterization of New genes 
Identifying new genes due to improved assembly or 
improved gene prediction 

In order to compare the old and new OGS (OGSvl.O 
and OGSv3.2), we mapped OGSv3.2 coding sequences 
to the Amel_2.0 assembly because it was the assembly 
used to predict OGSvl.O genes [1], which were later 
mapped to Amel_4.0. We chose to use Amel_2.0 in this 
step, because generation of Amel_4.0 and intermediate 
assemblies had involved rearranging and splitting scaf- 
folds and incorporating unscaffolded contigs using an 
updated genetic map [1]. We wished to minimize map- 
ping errors in our comparison of gene sets by using 
Amel_2.0 to determine which OGSv3.2 genes corres- 
pond to OGSvl.O genes. 

We performed the mapping twice using different 
alignment criteria, stringent (95% identity and 80% align- 
ment coverage) and relaxed (95% identity and 50% align- 
ment coverage). We will report results only for the 
stringent criteria (Table 5) unless otherwise specified, 
but have provided the relaxed mapping results in Table 
S2 in Additional file 1. On the basis of mapping to 
Amel_2.0 and overlap between the OGSv3.2 coding se- 
quence alignments with OGSvl.O gene models, we di- 
vided the 15,314 OGSv3.2 genes into three sub-sets 
(Table 5). Any OGSv3.2 gene that did not align to the 
Amel_2.0 assembly we deemed a "Type I New" gene. 
Any OGSv3.2 gene that aligned to the Amel_2.0 assem- 
bly, but whose coordinates did not overlap an OGSvl.O 
gene by a single coding base pair on the same strand, we 
deemed a "Type II New" gene. Finally, any OGSv3.2 
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Coding Sequence Length 

Figure 3 Distribution of coding sequence lengths in OGSv3.2 and OGSvl .0. Histogram plots showing tine number of genes having "X " 
coding sequence length in bins of 20 nt are illustrated using points instead of lines to allow visualization of both distributions. The range in 
coding sequence length extends to 70,263 and 53,649 in OGSv3.2 (blue) and OGSvl .0 (red), respectively, but this figure zooms in to show lengths 
only up to 5,000 nt. There were 386 and 344 genes with coding sequences longer than 5,000 nt in OGSv3.2 and OGSvl .0, respectively. This fgure 
shows that the increased number of genes in OGSv3.2 is largely due to increased numbers of short genes. The number of larger genes is not 
decreased, so gene splitting is not likely a major source of additional genes. 



gene that both aUgned to the Amel_2.0 assembly and 
overlapped an OGSvl.O gene we deemed a "Previously 
Known" gene. 

Of the 5,157 additional genes in OGSv3.2 compared to 
OGSvl.O, 4,735 genes did not overlap OGSvl.O genes. 
The other 422 additional genes in OGSv3.2 were due to 
either splitting OGSvl.O genes (discussed above) or to 
the annotation of additional recent paralogs in OGSv3.2, 
and are included in the total count of 10579 Previously 
Known genes. The 4735 genes that did not overlap with 
OGSvl.O genes could be classified as 782 genes discov- 
ered due to the additional sequencing and reassembly of 
the bee genome for the Amel_4.5 assembly (Type I New 
genes; Table 5) and 3,953 genes detected by improved 
gene prediction, either through the use of new expressed 
sequence and protein data or improved gene prediction 
algorithms (Type II New genes; Table 5). We could map 
405 additional Type I New genes to the Amel_2.0 as- 
sembly if we required only 50% of the gene to be cov- 
ered (Table S2 in Additional file 1), rather than the more 
stringent 80% gene coverage reported in Table 5, con- 
sistent with the Amel_2.0 assembly being less continu- 
ous than Amel_4.5. This lack of contiguity and resulting 
fragmentation of genes in the Amel_2.0 assembly likely 



impaired the initial gene prediction efforts. While the 
frequency of genes with some form of biological support 
(transcript, peptide, protein homolog, InterPro domain, 
and/or alignment to another bee genome) was highest 
for Previously Known genes (99.7%), most of the Type I 
New genes (93.8%) and Type II New genes (89.9%) were 
also supported (Table 5). 

Characteristics of new genes 

We analyzed the data that went into the annotation of 
OGSv3.2 and evaluated which pieces of evidence con- 
tributed to the prediction of the genes to understand 
why they were missed in OGSvl.O. To determine 
whether genes that were not detected in OGSvl.O have 
common characteristics that make them more challen- 
ging to predict, we compared them to the Previously 
Known genes. We evaluated features such as tissue ex- 
pression specificity, coding feature length, GC content, 
overlap of protein homolog alignments on Amel_4.5 and 
the proportion of non-canonical splice sites. Additional 
file 3 provides sources of evidence for each OGSv3.2 
gene. 

The mean coding sequence lengths of Type I New 
(1,172 bp) and Type II New genes (331 bp) were shorter 
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Scaffold analysis 



CDS analysis 



Intron analysis 

Peptide analysis 
Protein analysis 



Total spliced and un-spliced expressed 
sequence support 

Spliced expressed sequence analysis 



Number of genes (% of total OGSv3.2 genes) 

Number of genes within mapped scaffolds (% of no. of gene type) 

Number of genes within un-mapped scaffolds (% of no. of gene type) 

Average CDS length 
Average no. CDS Exons 
Number of single CDS exon genes (% of no. of gene type) 

Number of multi-CDS exon genes (% of no. of gene type) 

Number of introns (% of total OGSv3.2 introns) 

Number of introns validated by EST intron coordinates (% of introns of gene type) 

Number of genes with a peptide match (% of no. of gene type) 

No. of genes with overlap to at least one protein alignment (% of no. of gene type) 

No. of genes with overlap to a Dmel protein alignment (% of no. of gene type) 

No. of genes with overlap to at least one transcript alignment from any of the ten libraries 

(% of no. of gene type) 

No. of genes with overlap to at least one spliced transcript alignment from each of the ten 
ibraries (% of no. of gene type) 

No. of genes with overlap to at least one spliced transcript alignment from any of the ten 
ibraries (% of no. of gene type) 

No. of genes without overlap to any spliced transcript alignments in any of the ten libraries 

(% of no. of gene type) 

Genes broadly expressed across four tissues (% of no. of gene type) 

Genes narrowly expressed in only a single tissue (% of no. of gene type) 

No. of genes without overlap to any spliced transcript alignments in any of the four tissues 

(% of no. of gene type) 



15,314 
(100%) 

1 3,285 
(86.8%) 

2,029 
(1 3.2%) 

1,266 

5.3 

2,059 
(1 3.4%) 

13,255 
(86.6%) 

66,212 
(100%) 

54,514 
(82.3%) 

3,631 
(23.7%) 

6,778 
(44.3%) 

1,205 
(7.9%) 

13,517 



1,062 
(6.9%) 

12,172 
(79.5%) 

3,142 
(20.5%) 

2,326 
(1 5.2%) 

3,346 
(21.8%) 

3,632 
(23.7%) 



782 (5.1%) 

544 (69.6%) 

238 (30.4%) 

1,172 
5.6 

101 (12.9%) 
681 (87.1%) 
3,585 (5.4%) 
2,573 (71.8%) 
132 (16.9%) 
270 (345%) 

38 (4.9%) 
704 (90.0%) 

32 (4.1%) 
622 (79.5%) 
1 60 (20.5%) 

60 (7.7%) 
234 (29.9%) 
1 92 (24.6%) 



3,953 (25.8%) 

3,199 (80.9%) 

754 (19.1%) 

330 
2.1 

1,239 (31.3%) 
2,714 (68.7%) 
4,333 (6.5%) 
1,930 (44.5%) 
82 (2.1%) 
1 86 (4.7%) 
1 3 (0.3%) 
2,771 (70.1%) 

1 5 (04%) 
2,110 (53.4%) 
1,843 (46.6%) 

95 (24%) 
1,139 (28.8%) 
1,985 (50.2%) 



10,579 (69.1%) 

9,542 (90.2%) 

1,037 (9.8%) 

1,622 
6.5 
719 (6.8%) 

9,860 (93.2%) 

58,294 (88%) 

50,011 (85.8%) 

3,417 (32.3%) 

6,322 (59.8%) 

1,154 (10.9%) 

1 0,042 (94.9%) 

1,015 (9.6%) 

9,440 (89.2%) 

1,139 (10.8%) 

2,171 (20.5%) 

1 ,973 (1 8.7%) 

1,455 (13.8%) 
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Table 5 New and previously known OGSv3.2 genes (Continued) 



Analysis of alignments to other bee 
genomes 



Evidence-supported genes 



GC analysis 



ENC analysis 



No. of genes that align to Aflo_1.0 (% of no. of gene type) 

No. of genes that align to Bter_1 .0 (% of no. of gene type) 

No. of genes with overlap to at least one form of biological evidence (% of no. of gene type) 

No. of genes that align to Aflo_1.0 and/or Bter_1.0 and/or overlap at least one form of 
biological evidence (% of no. of gene type) 

Number of genes on GC compositional domains >10 kb (% of OGSv3.2 total) 

Avg. GC content of compositional domain gene resides in 
Effective number of codons 



13,491 566(72.4%) 2,584(65.4%) 
(88.1%) 

1 2,262 527 (67.4%) 1 ,566 (39.6%) 
(80.1%) 



14,084 
(92.0%) 



29.60% 
44.95 



713 (91.2 



14,836 734(93.9 
(96.9%) 

15,224 777(5.19 



2640% 
41.97 



2,930 (74.1%) 

3,555 (89.9%) 

3,923 (25.8%) 

32.00% 
45.69 



10,341 (97.8 

10,169 (96.1 

10,441 (98.7 

1 0,547 (99.7 

10,524 (69.1 

28.90% 
44.9 



Genes were mapped to Amel_2.0 assembly with stringent mapping criteria of 80% gene coverage and 95% identity. Biological evidence includes transcript overlap (spliced or un-spliced), peptide hit, protein homolog 
alignment overlap, or InterPro domain presence. 
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than that of Previously Known genes (1,623 bp) (P = 2.3 x 
10'^^ and P < 2.2 x 10'^^ respectively) (Table 5). This dif- 
ference may be due to a higher fraction of single coding 
exon genes among new genes. Thirteen percent of Type I 
New genes and 31% of Type II New genes contained one 
coding exon, while only 6.8% of Previously Known genes 
contained one coding exon (P < 2.805 x 10' and P < 2.2 x 
10'^^ for Type I and Type II genes, respectively) (Table 5). 
The number of canonical versus non-canonical splice sites 
was not significantly different between the Previously 
Known and Type II New genes (Table S3 in Additional 
filel). 

Type I New genes were found to reside in GC com- 
positional domains with mean 26.4% GC, significantly 
lower than that of Previously Known genes (28.9%) 
(P = 2.188 X 10' ), supporting improvement in the as- 
sembly of the high-AT-content regions. On the other 
hand, Type II New genes were found in GC compos- 
itional domains with a mean GC content of 32.0%, 
higher than that of than Previously Known genes 
(P < 2.2 X 10'^*^), but still slightly lower than the mean 
GC content of the genome (32.7%) 

The effective number of codons is an estimate of the 
deviation from equality of synonymous codon usage of 
all codons of a gene, and ranges from 20 (extreme bias 
where only one codon is used for each amino acid) to 61 
(no bias, all synonymous codons are used equally) [34]. 
Type I New genes had a significantly lower mean effect- 
ive number of codons than Previously Known genes, 
41.97 vs. 44.90 respectively (P = 2.26 x 10'^^) (Table 5). 
This is consistent with the idea that the more extreme 
the deviation from equal proportions of G + C and A + T 
nucleotides in coding sequences, the lower the potential 
diversity of synonymous codons. Consistent with their 
locations in less extreme GC compositional domains. 
Type II New genes had a significantly higher mean ENC 
than the Previously Known genes, 45.69 vs. 44.90 re- 
spectively (P = 3.923 X 10'°^) (Table 5). 

Expression evidence for new genes 

The majority of OGSv3.2 genes were supported by tran- 
script evidence. When combined, the spliced and un- 
spliced transcript alignments overlapped 13,517 (88.3%) 
of OGSv3.2 genes. An analysis of OGSv3.2 gene cover- 
age by transcript dataset (Table S4 in Additional file 1) 
showed that the fraction of genes represented in a tran- 
script dataset ranged from 28.3% (both larvae and testes) 
to 79.2% (forager brain) (Table S4 in Additional file 1). 
Both Type I New (79.5%, 622) and Type II New (53.4%, 
2110) genes were less likely to overlap a spliced tran- 
script alignment than Previously Known genes (89.2%, 
9,440) (P = 3.298 x 10'^*^ and P < 2.2 x lO'^'', respectively) 
(Table 5). 



Compared to Previously Known genes both Type I 
New and Type II New genes were more likely to be nar- 
rowly expressed or overlap transcript alignments from 
only one tissue (P = 2.136 x 10'^* and P < 2.2 x lO'^*"', re- 
spectively) 18.7% of Previously Known, 29.9% of Type I 
New and 28.8% of Type II New were narrowly expressed 
(Table 5). Conversely, Previously Known genes (20.5%, 
2,171) were more likely to be broadly expressed or over- 
lap transcript alignments from all tissues than either 
Type I New or Type II New genes (P < 2.2 x lO''^^ for 
both tests). 20.5% of Previously Known, 7.7% of Type I 
New and 2.4% of Type II New genes overlapped tran- 
script alignments from all tissues. However, the fractions 
of narrowly expressed (18.7%) and broadly expressed 
(20.5%) genes were similar to each other for Previously 
Known genes. 

Homolog alignment evidence for new genes 

Results suggested that the use of dipteran proteins as 
the main source of protein homolog evidence for 
OGSvl.O may have been a contributing factor to an in- 
complete gene set; 44.3% of the OGSv3.2 genes over- 
lapped at least one protein homolog aUgnment, but only 
7.9% overlapped an alignment of a D. melanogaster pro- 
tein (Table 5). Differences among gene sets further sup- 
ports the limited value of Dipteran proteins as a 
primary source of homolog evidence for A. mellifera 
gene prediction. Both Type I New and Type II New 
genes were less likely than Previously Known genes to 
overlap a D. melanogaster homolog alignment (P = 
1.394 X 10'"^ and P < 2.2 x 10'^*^, respectively). The ef- 
fect was more drastic for Type II New genes; only 0.3% 
of Type II New genes overlapped a D. melanogaster 
homolog alignment, while the proportions were 4.9% 
and 10.9% for Type I New and Previously Known genes, 
respectively. 

Both Type I New and Type II New genes were less 
likely than Previously Known genes to overlap any 
homolog alignment from another sequenced genome 
(P < 2.2 X lO'^'' for both tests), but the difference was 
more extreme for Type II New genes, potentially imply- 
ing that a greater number of the Type II New genes are 
specific to either A. mellifera, bees, or Hymenoptera 
than the Previously Known genes (Table 5). 59.8% of 
Previously Known genes overlapped a homolog align- 
ment, while only 34.5% and 4.7% of Type I and Type II 
new genes, respectively, overlapped a homolog alignment 
(Table 5). 

Genome alignment evidence for new genes 

Over 80% of the OGSv3.2 genes aligned to both the 
Aflo_1.0 and Bter l.O genome assemblies (Table 5). 
A notable difference between Type II New genes and the 
other gene sets was the proportion of genes that were 
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supported by genome conservation, but not other sources 
of evidence. Of the 3,555 Type II New genes supported 
by any evidence, 17.5% were not supported by other 
sources. On the other hand, only 2.9% of supported Type 
I New genes and 1% of supported Previously Known 
genes were supported by only genome conservation. 

The remaining 20% of the OGSv3.2 genes have the 
potential to be A^/s-specific (8% of genes that align to 
Aflo_1.0, but not Bter_1.0) or A. mellifera specific (12% of 
OGSv3.2 genes that do not align to Aflo_1.0) (Table 5). 

Peptide analysis 

Use of peptide evidence 

Peptide data were used in four ways. First, peptide data 
were used in gene prediction by AUGUSTUS, the only 
program used in this study that accepts this source of 
gene evidence. Second, peptides were compared to all 
consensus gene prediction sets to identify the set with 
the highest representation of peptides (described above; 
Additional file 2). Third, peptide support was used in 
characterizing Previously Known, Type I and Type II 
New genes. Fourth, venom peptides were used to manu- 
ally annotate venom genes associated with the sting of 
the honey bee. 

Gene set comparison with all peptides 

Peptide sequences aligned to a greater number of Previ- 
ously Known genes than to Type I or Type II New genes 
(P < 2.2 X 10 ^ for both tests). Peptides aligned to 32% 
of the Previously Known genes, but to only 17% of Type 
I and 2% of Type II New genes (Table 5 and Additional 
file 3). 

Venom peptide analysis 

Despite the lower proportion of new genes compared to 
Previously Known genes with peptide matches, analysis 
of venom peptides (a subset of the peptide data) showed 
that the Amel_4.5 assembly provides a significant contri- 
bution to venom proteome research. 705 unique venom 
peptides provided biological evidence for 102 genes 
(described in [35]). Searching the venom mass spectra 
against OGSvl.O and OGSv3.2 showed that the improved 
assembly allowed detection of 21 additional peptides sup- 
porting 9 new venom protein identifications in OGSv3.2 
(Additional file 4). Additional tryptic peptides were dis- 
covered for 7 venom proteins as a result of improved 
gene predictions (Additional file 4). 

Manual annotation of genes supported by the venom pep- 
tides (Additional file 5) using Apollo [36] showed that most 
honey bee venom genes are fully (76.5%) or partially (19.6%) 
covered by transcriptome evidence. The putative biological 
functions of these genes are described elsewhere [35]. We 
discovered that one of the annotated genes (GB40695) codes 
for tertiapin. WhUe the tertiapin peptide was already found 



in bee venom [37], no genomic or transcriptomic evidence 
had been described, an issue which is now solved as the 
genome improvement project supplies both a gene predic- 
tion (GB40695, NCBI Gene ID 100576769) and EST evi- 
dence (GenbarLk:HP466647.1). The gene is positioned on 
linkage group 12, next to the apamin and mast cell de- 
granulating peptide venom genes. The three genes are 
tandemly arranged which suggest their origin by gene du- 
plication via unequal crossing over, and may also point to 
a joint control of transcription [38]. 

Orthology assessment 

Analysis of two different A. mellifera gene sets in com- 
parison to genes from other insects allowed us to investi- 
gate the impact of the new genome and gene annotations 
on the numbers of A. mellifera genes in near-universal 
orthologous groups. Although true gene losses can and 
do occur, these near-universal insect orthologous groups 
highlight possible missed gene annotations in each 
species. 

Numbers of orthologs were counted for each of the two 
sets, V2 (similar to OGSvl.O, described in Methods) and 
OGSv3.2. Figure 4 and Table S5 in Additional file 1 show 
the near-universal orthologs missing in each genome. The 
pea aphid [Acyrthosiphon pisum) was found to be miss- 
ing more orthologous groups than the other insects. The 
A. mellifera V2 annotation was average for the genomes 
analyzed (missing 263 orthologs), but the OGSv3.2 anno- 
tation was much more complete, missing fewer orthologs 
(112) than the other genomes. Thus, the new gene set 
OGSv3.2 reduced the total number of potentially missing 
orthologs in A. mellifera by 57%, demonstrating the annota- 
tion improvement that recovered more "universal" insect 
orthologs than the previous one did. 

Predicted gene functions 
Gene ontology analysis 

Gene Ontology analysis (Additional file 6) showed enrich- 
ment of some specific functions in new genes. Type I New 
genes were enriched for the biological processes "apop- 
tosis" (corrected e-score = .0016), "neurotransmitter re- 
lease" (corrected e-score = .0025), and "secondary active 
organic cation transmembrane transporter" (corrected 
e-score < .0001). The detection of these Type I New genes 
due to new assembly data is likely due to their location in 
low GC content regions, which were underrepresented in 
the older assembly; this is in agreement with the lower 
mean GC content of GC compositional domains contain- 
ing Type I New genes. Type II new genes were enriched 
for the molecular function "nuclease activity" (corrected 
e-score = .011). Rapidly evolving and single exon genes are 
more difficult to annotate automatically, so it is perhaps 
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Figure 4 Insect orthologs in two A mellifera gene sets (V2 and OGSv3.2). For each species, counts of near-universal ortliologous groups tliat 
are missing an ortholog in tliat species, or in that species and one other species, are shown. Total counts are divided into groups with only 
single-copy orthologs and those with gene duplications, further divided into those with only one missing species and those with two 
missing species. 



not surprising that we found significant differences in 
some activities. 

InterPro analysis 

Comparison of InterPro protein domains found in 
OGSv3.2 and OGSvl.O proteins showed that 269 of the 
InterPro domains present in OGSv3.2 were not present 
in OGSvl.O (Additional file 7). There were also signifi- 
cant differences in the fraction of genes with the 
domains IPR004117 (Olfactory receptor, Drosophila, 
p<.02) and IPR001888 (Transposase, type 1; p < .01). 
There were 141 and 92 genes with olfactory receptor do- 
mains in OGSv3.2 and OGSvl.O, respectively. Olfactory 
receptor genes contain a single olfactory receptor do- 
main, so the domain count corresponds to the gene 
count. The additional olfactory receptor genes were 
found among both the Type I and Type II New genes. 
The family of olfactory receptor genes is expanded in 
hymenopteran insects, with rapidly diverging family 
members that are often arranged in tandem arrays 
[20,21,39]. Many of the 166 known A. mellifera olfactory 
receptor genes were identified by manual annotation of 
the previous assembly, because they were not found in 
OGSvl.O [20,21,39]. Improved computational prediction 



of olfactory receptors in the new assembly was likely due 
to increased assembly continuity which would improve 
identification of tandemly repeated genes, and new 
sources of biological sequence alignment evidence (e.g. 
transcript and protein homolog sequences) which im- 
prove identification of rapidly diverging genes. 

There were 18 and 3 genes with transposase type 1 
domains in OGSv3.2 and OGSvl.O, respectively. Thir- 
teen of the additional transposase type Idomains were in 
Type 1 New genes. Transposases are associated with 
transposable elements, which are a class of interspersed 
repeats. Since the new assembly information filled in 
gaps, and gaps in de novo assemblies are found where 
the assembly process fails to find an unambiguous path 
(e.g. repetitive loci), we were not surprised to identify 
InterPro domains associated with repetitive sequences 
among the Type I New genes. Predicted genes with 
matches to transposable element domains are often re- 
moved from gene sets. We further investigated OGSv3.2 
for presence of interspersed repeats, and decided not to 
remove the small number of genes associated with trans- 
posable element domains, because evidence suggested 
that some were real host genes (see Genome-wide repeat 
analysis). 
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Genome-wide repeat analysis 

Processing the Amel_4.5 assembly with the REPET pipe- 
Hne yielded 2,401 de novo predicted repetitive elements, 
of which 1,045 were validated by annotation of at least 
one complete copy. In total 9.46% (22.13 Mb) of the gen- 
ome appears to be repetitive (Table 6). Non-interspersed re- 
peats (SSR, low complexity, satellite) accounted for 4.05% 
(9.49 Mb), whereas interspersed repeats represented 5.07% 
(12.69 Mb) of the Amel_4-5 assembly. The latter estimate 
for transposable elements is similar to a previous estimate 
of 3% [1], whereby only Mariner and R2 elements were re- 
ported. Thus the honey bee remains a species with an un- 
usually low amount of repetitive DNA. 

Most of the groups of retrotransposable elements 
were detected in the genome of the honey bee. In com- 
parison to many other organisms, the most striking dif- 
ference is the extremely low diversity and abundance of 
these elements. LTR retrotransposons accounted for 
only 0.02% of the genome (49.6 kb) and included exam- 
ples from only one Copia and a putative, unclassified 
element. Only fragments of elements from the BelPao 
and Gypsy superfamily were found. The DIRS group 
was represented by two incomplete elements and 
accounted for 0.01% of the genome (12.5 kb). LINEs 
accounted for only 0.04% of the genome (83.1 kb), and 
were represented by only two R2 elements, one / (nim- 
bus) element, and a few fragments, potentially belonging 
to / (Rl) and Jockey [CRl). Of the SINE elements de- 
tected, 14 could not be classified and five had similar- 
ities to SINEs of the 5S type, all representing 0.03% of 
the genome (70 kb). Together with another unclassified 
element, all class I retroelements summed up to only 
224 kb (0.09%) of the genome, among the lowest in the 
animal kingdom. TRIMs (terminal repeat retrotranspo- 
sons in miniature) [40] and LARDs (large retrotrans- 
poson derivatives) [41] are derivatives of retroelements 
and were detected in larger number occupying 9.57 Mb 
(4.09%) of the genome (Table 6). 

Class II DNA transposons were more frequent and 
accounted for 0.57% of the genome (1.34 Mb). The vast 
majority of the elements were TIR (terminal inverted re- 
peat) transposons of the Mariner superfamily (0.49%, 
1.15 Mb). Otherwise only two elements of the Piggy Bac 
superfamily (0.04%, 88 kb) and five unclassified TIR ele- 
ments could be detected. Other types of DNA transpo- 
sons, Crypton, Maverick and Helitron could not be 
found. The DNA transposon derivatives, MITES (mini- 
ature inverted transposable elements), were found only 
once and accounted for less than 0.01% of the genome 
(3.8 kb) (Table 6). 

Besides the well-classified sequences, many repetitive 
elements could not be assigned to a superfamily or even 
class. The latter includes a larger number of elements 
(0.65%, 1.52 Mb), which could represent novel types. 



but need further investigation. We separately annotated 
and excluded from the transposable element counts ele- 
ments that could not be categorized and elements with- 
out typical transposable element features that contained 
profiles from protein coding genes. Both together com- 
prise 2.96% (6.93 Mb) of the genome (Table 6). 

The detected elements of the R2 and the majority of 
those of the Mariner type belonged to or were very similar 
to previously described elements from A. mellifera or other 
insects (RepBase vl7.01, [42]). A few other Mariner and the 
PiggyBac elements showed similarities to elements known 
from distant animal species. This indicates a potential hori- 
zontal gene transfer as previously suggested [43] . 

Most of the elements appeared to be fragmented and 
incomplete. Although some contained sequences of typ- 
ical transposable element protein domains, they seemed 
to be inactive due to stop codons and frameshift muta- 
tions. We detected only four retrotransposons and 27 
DNA transposons with RT (reverse transcriptase) or 
Tase (transposase) domains, respectively. None of the 
retrotransposons appeared to possess an active ORE 
containing an entire RT domain, so we classified them 
as inactive. Among the DNA transposons, six of the 
Mariners appeared to be complete and two were poten- 
tially active. Five additional Mariner elements possessed 
an intact ORF spanning at least parts of a Tase domain, 
so they might have limited activity. The higher abun- 
dance and higher number of chimeric inserts of Mari- 
ners (Table 6) suggests that transposons were more 
recently active than retrotransposons. 

Repeats associated with the official gene set 

Since gene annotation had been performed on an assembly 
that was masked for repeats early in the project, prior to 
the availability of results from the genome-wide repeat ana- 
lysis, we expected some OGS genes to overlap newly- 
detected repeats due to incomplete masking. Coding se- 
quences of 1,234 genes overlapped interspersed repeats de- 
tected by REPET. These genes were further investigated for 
characteristics that would support their annotation as host 
genes, including whether the gene was Previously Known, 
had multiple coding exons, overlapped a spliced transcript 
alignment, and had InterPro domain matches (Additional 
file 8). Of these 1,234 REPET-overlapping genes, 739 were 
classified as Previously Known, 185 as Type I New genes, 
and 310 as Type II New genes. 1,040 of the REPET- 
overlapping genes had multiple exons and 972 overlapped 
spliced transcript alignments. 

Inspecting InterPro results to identify protein domains 
known to be associated with transposable elements, but 
not host genes, resulted in the following list of domains: 
DDE superfamily endonuclease; Integrase, catalytic core; 
Reverse transcriptase, RNA-dependent DNA polymerase; 
Retrotransposon, Pao; Ribonuclease H domain; Ribonuclease 
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H-like domain; Transposase, Tel -like; Transposase, type 1; 
Transposase, Synechoq'stis PCC 6803. Of the 760 REPET- 
overlapping genes that had InterPro matches, only 35 
matched one of these transposable element domains. Some 
protein domains, such as zinc fingers, peptidases and heli- 
cases, are similar between host genes and transposable ele- 
ments, so cannot be used to classify genes as transposable 
elements. Among the genes that matched non-transposable 
element domains were some known to be members of large 
gene families, such as olfactory receptors, and genes with re- 
petitive domains, such as ankyrin repeat-containing domains. 
We were not surprised that these would be included in a set 
of de novo detected repeats. 

Of the 35 genes with transposable element domain 
matches, 13 had other domains suggesting they could be 
host genes, and 28 overlapped spliced transcript align- 
ments. Only four genes with either transposable element 
or unknown/uncharacterized domain family matches 
lacked evidence of being a host gene. That is these four 
genes had a single coding exon, lacked spliced transcript 
overlap and had no other InterPro domain match. We 
chose not to remove from the OGS genes that overlapped 
repeats detected by REPET or genes that matched Inter- 
Pro transposable element domains, because of evidence 
supporting a large number of them as host genes, and 
the possibility that transposable elements may have 
contributed to the evolution of some host genes (e.g. 
reviewed in [44]). 

Discussion 

In the approximately seven years that have elapsed be- 
tween generation of OGSvl.O and OGSv3.2, new se- 
quence was added to the assembly; new sequencing 
technologies, genome assembly methods and gene pre- 
diction methods were applied; and new sources of gene 
prediction evidence became available. Several lines of evi- 
dence indicate that the new assembly is more complete. 
These include improved continuity of scaffolds, increased 
coverage of low GC content regions of the genome, iden- 
tification of 782 new genes that do not align to the older 
assembly, and increased number of detectable repetitive 
elements. The upgraded assembly allowed us to conduct 
a comprehensive analysis of repetitive elements in the 
genome, create an improved gene set and confirm previ- 
ous findings related to GC composition. It also allowed 
us to confirm that low amounts of transposable elements 
and repetitive DNA are bona fide features of the honey 
bee genome, and not artifacts of incomplete assembly 
and annotation. 

Transposable elements can be a major factor in genome 
and gene evolution. Previous analyses found a low number 
of transposons and retrotransposons in the assembled 
genome compared to other sequenced insect genomes [1], 



but researchers questioned whether additional elements 
were present in unassembled portions of the genome. 
The upgraded assembly allowed us to better characterize 
repetitive elements in the genome. Despite a more com- 
prehensive repetitive element annotation, the genomic 
coverage of transposable elements was extremely low, 
most striking for the retrotransposons, in agreement with 
the previous analyses. The most apparent difference be- 
tween A. mellifera and other hymenopteran insects is the 
relative lack of retrotransposable elements; genomes of 
other bee, wasp, ant and insect species all contain higher 
proportions [22,23,45]. Although comparisons across 
studies are difficult due to methodological differences, our 
results show that the total fraction of repetitive DNA in 
the A. mellifera genome (9.46%) is lower than that of most 
other sequenced hymenoptera genomes. Some of the 
ant genomes have more than twice as much repetitive 
DNA {Atta cephalotes, 25%, Linepithema humile 23.5%, 
Acromyrmex echinator 27%, Harpegnathos saltator 27%, 
Solenopsis invicta >23%) and the genome of the parasitoid 
wasp, Nasonia vitripennis, contains more than three times 
the amount (>30%) [18-20,22-24]. The ants Camponotus 
floridanus and Pogonomyrmex barbatus are more similar 
to A. mellifera, with estimates of 15% and 9%, respectively 
[18,21]. However, those analyses did not include LARD, 
TRIM and MITE elements, which make up a considerable 
fraction of the repetitive elements in A. mellifera (4.1% of 
the genome; 43.3% of the repetitive DNA). Without these 
derivative elements, A. mellifera would possess well less 
than 1% transposable elements. This extraordinarily low 
proportion of mobile elements suggests that evolutionary 
processes molding the A. mellifera genome differ from 
processes working on other hymenopteran genomes, even 
though many of the species listed above are also insects 
with eusocial lifestyles. 

Combined with new biological evidence for gene predic- 
tions, the upgraded assembly allowed for significant im- 
provement to the A. mellifera gene set, with >50% more 
genes. The identification of 269 additional InterPro do- 
mains and the 57% reduction in number of missing univer- 
sal insect orthologs indicate a more comprehensive catalog 
of protein functions. The presence of nine new venom pro- 
tein genes, the detection of new tryptic peptides in existing 
venom protein genes, and the 53% increase in the number 
of computationally identified olfactory receptor domains 
are examples of more comprehensive annotation of specific 
gene families important to bee biology enabled by the im- 
proved genome. 

The previously described A. mellifera genome charac- 
teristics of low and heterogeneous GC content [1] remain 
after the addition of new sequence to the assembly. Ex- 
pansion and improvement of the low GC content regions 
in the new Amel_4-5 assembly was supported by the 
identification of Type I New genes, which were found in 
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regions of lower GC content than that of Previously 
Known genes. While it is impossible to know whether 
new gene prediction evidence (transcript, peptide, 
homolog, genome conservation) affected the ability to 
predict the Type I New genes, the identification of 
Type II New genes in regions with slightly higher GC 
content than Previously Known genes suggested that 
the addition of new gene evidence had a greater impact 
on gene prediction in higher GC regions. Although any 
single evidence type was less frequent in the Type II 
New genes, the total number of Type II New genes 
supported by the evidence (3,555) was higher than that 
of Type I New genes (734). Higher recombination rates 
[46] and rates of molecular evolution in high GC con- 
tent regions of the A. mellifera genome [47] may have 
contributed to sequence divergence that made Type II 
New genes difficult to detect when generating 
OGSvl.O. High GC content regions have been shown 
to be enriched in genes associated with behavioral 
traits [47], suggesting that some Type II New genes 
may be associated with important bee-specific traits. 

Despite the higher mean GC content of regions contain- 
ing Type II New genes, the strong bias for A. mellifera 
genes to reside in low GC content regions relative to the 
genome [1] remains. Among the wide range of insect ge- 
nomes we have examined thus far, only the hymenop- 
terans A. mellifera and H. saltator show a strong bias for 
genes to occur in compositional domains with low GC 
content, although S. invicta, P. barbatus, L. humile and 
N. vitripennis show a slight bias [48]. The biological mean- 
ing of this, and whether this is related to the lifestyles of 
these hymenopterans, is still unclear. 

Comparisons of the old and new gene sets suggested 
that short and single coding exon genes, with spatially- 
or temporally-restricted expression patterns and low 
protein homology remain difficult genes to predict. Com- 
pared with Previously Known genes. Type II New genes 
were more rarely and narrowly expressed, had shorter 
coding sequences, were more likely to be single coding 
exon genes, and were less likely to have detectable homo- 
logs. The characteristics of Type II New genes were 
consistent with those of new genes identified in the mod- 
Encode effort to reannotate the developmental transcrip- 
tome of Drosophila melanogaster [49]. Their "new 
transcribed regions" (NTRs) in Drosophila also had low 
expression levels with temporally restricted patterns. In 
addition more than half of these NTRs were single-exon 
genes, and the multi-exon NTRs were shorter and less 
conserved than previously annotated genes [49]. While 
difficulty in computational identification of functional 
small open-reading frames (smORFs; <100 codons) has 
led to their low representation in genome annotations, 
increasing evidence supports functional smORFs in eu- 
karyotes [50,51]. 



Efficient and effective annotation methods for non- 
model organisms has become critical in a time when ini- 
tiatives such as i5K (5000 insect genomes [52]) and 
GIOK (10,000 animal genomes [53]) motivate scientists 
to investigate genomes of diverse organisms, many of 
which will be evolutionarily distant from existing model 
organisms. Our results suggest that investigators wishing 
to comprehensively annotate protein-coding genes in ge- 
nomes of non-model organisms should invest in tran- 
scriptome sequencing that is both deep and broad, 
similar to findings in re-annotation of the green anole 
lizard (Anolis carolinensis) genome [54]. Type II New 
genes were more likely to be narrowly expressed, so 
transcript evidence from only a single tissue may have 
missed a high proportion of the expressed genes. Tran- 
scriptome sequence from multiple tissues, life stages and 
conditions may be more useful than protein homolog 
evidence; the number of Type II New genes with tran- 
script evidence exceeded the number of those with pro- 
tein homolog evidence, despite the relatively close 
evolutionary distance between A. mellifera and some of 
the reference species, which included six species within 
the order Hymenoptera. Even with sampling many tran- 
scriptomes, rarely expressed genes can be missed, and 
genomic sequence from closely related species can aid 
gene prediction by leveraging nucleotide sequence con- 
servation; our analysis showed that 752 of the OGSv3.2 
genes were supported by genome conservation as their 
only source of biological evidence. As more genomes are 
sequenced and annotated, gene prediction should be- 
come easier. When expression and homolog evidence 
are lacking, it is important to train ab initio prediction 
algorithms using genes representing the true distribution 
of gene features such as coding sequence and exon 
length, codon usage and GC content. Our analysis of 
Type I and Type II New genes suggests that the bimodal 
distribution of genome GC content in A. mellifera re- 
sults in at least two classes of genes distinguished by dif- 
ferent distributions of GC content and codon usage. 
Ab initio gene predictors may benefit by training with 
coding sequences from each class separately. 

Conclusions 

We have shown that next generation sequencing followed 
by de novo annotation can substantially improve an unfin- 
ished first generation genome sequence assembly and an- 
notation. The upgraded assembly and annotation allowed 
us to confirm that the honey bee genome contains a 
typical number of genes relative to other insects. The 
improved honey bee gene set will be invaluable to the 
honey bee research community in efforts to elucidate the 
mechanisms behind fundamental biological processes, 
such as evolution of insect eusociality, as well as agricul- 
tural issues, such as pollinator health and immunity. 



Elsik ef al. BMC Genomics 2014, 15:86 
http://www.biomedcentral.com/1471-2164/15/86 



Page 18 of 29 



Furthermore, understanding the reasons genes were not 
predicted in OGSvl.O will lead to more effective gene pre- 
diction strategies for new genome projects. 

Methods 

Genome sequencing and assembly 

We improved the published genome assembly of 2.7 
million Sanger reads of the honey bee genome, version 
Amel_4.0 [1] by incorporating ABI Solid sequence and 
Roche 454 paired-end sequence to superscaffold and 
gap-fill the Amel_4.0 assembly using the Atlas-Link 
software [55]. The new sequence data as well as the 
existing sequence data were used to link the genome 
contigs from Amel_4.0 into more contiguous scaffolds. 
Adjacent contig sequences within these scaffolds were 
assessed and overlapping and redundant contigs were 
merged. 

We used the paired-end reads for superscaffolding 
and intra-scaffold gap filling. The Amel_4.5 assembly 
contains new scaffolds formed from merging existing 
scaffolds and filling some intra-scaffold gaps with other 
scaffolds or contigs. 

The newly formed scaffolds were anchored to the same 
linkage group that their member contigs were anchored to 
in Amel_4.0. Ten of the new scaffolds could be anchored 
to two different linkage groups so a manual break was 
inserted to split these scaffolds for consistent anchoring. 

New data for gene prediction 
RNAseq data 

We sequenced samples from a number of tissues using 
the 454 Titanium technology. Tissues from ovary, testes, 
mixed antennae (worker, drones, and queens), larvae, 
mixed embryos, abdomen, and a combined library from 
brain and ovary samples were sequenced. 

The testes cDNA library was prepared with the Clontech 
"full-length" amplification protocol. A gel cut of 400 to 
800 bp fragments was combined with nebulized products 
from the larger cDNA fragments to generate fragments of 
an optimum sized sequencing library for the 454 Titanium 
platform. Other libraries were prepared by isolating total 
RNA using Trizol and RNeasy columns followed by mRNA 
isolation using the Qiagen kit and cDNA genration using 
the Invitrogen Superscript kit #11917-010 with random 
primers. A gel cut of 400 to 800 bp fragments was com- 
bined with nebulized products from the larger cDNA 
fragments to generate fragments of an optimum sized 
sequencing library for the 454 Titanium platform. 

Peptide data 

Honey Bee peptide atlas We performed liquid chro- 
matography-tandem mass spectrometric (LC-MS/MS) 
analysis of bee protein extracts from 253 samples, 
representing three castes, larvae and virtually all adult 



honey bee tissues in both sexes and of different disease 
states. Tissue collection and interpretation of the LC- 
MS/MS data has largely been described elsewhere [56-61] 
and is available for public download from the Honey Bee 
Peptide Adas [62]. The raw data from these studies, 
amounting to more than 8 x 10^ tandem mass spectra, 
were searched against a six-frame translation (3,596,047 
forward sequences, with reversed complemented se- 
quences and common contaminants concatenated) of 
Amel_4-5 using Mascot (v2.3. Matrix Science). This exer- 
cise identified 30,622 unique peptide sequences at a false 
discovery rate of 1%, of which 5,834 peptides mapped to 
regions of the genome, where genes were previously un- 
known in Amel_4.0. As these results were based solely on 
a basic translation of raw genomic sequence, these pep- 
tides represent only tryptic peptides wholly within a single 
exon. 



Venom peptides We analyzed the honey bee worker 
venom proteome by integrating a combinatorial peptide 
ligand library (CPLL) with liquid chromatography/Fourier 
transform ion cyclotron resonance (LC-FTICR) MS/MS. 
The collection and interpretation of the MS/MS data are 
described elsewhere [35]. Gene prediction datasets and a 
sbc-frame translation of Amel_4.5 were searched using 
Mascot (v2.3. Matrix Science). Setting the significance 
threshold at p < 0.01 led to a peptide false discovery rate 
of 5.23% for the search of the AUGUSTUS AU9 gene set, 
2.51% for the AUGUSTUS AUll search and 3.77% for 
the Amel_4.5 NCBI RefSeq search. MS/MS data generated 
from the CPLL flow-through fractions, and the elution 
fractions separated by Tris-glycine- or Tris-tricine-SDS- 
PAGE gels were separately searched against the genome 
sbc-frame translation resulting in false discovery rates of 
respectively 1.56%, 3.75% and 3.44%. 

Annotation methods 
Summary of gene prediction sets 

Genes were predicted using a variety of methods including 
the NCBI RefSeq and Gnomon pipelines, AUGUSTUS, 
SGP2, GenelD, Fgenesh++ and N-SCAN. Unless stated 
otherwise below, gene prediction pipelines used the masked 
assembly available from NCBI [63], which was generated 
using RepeatMasker [64]. New transcriptome data was used 
either directiy as evidence or in the generation of training 
sets for the predictors. The Augustus analysis also incorpo- 
rated available peptide data, and the N-SCAN analysis lev- 
eraged nucleotide sequence conservation between the A. 
mellifem genome and the other bee genomes. SGP2 lever- 
aged conservation between the A. mellifem genome and 
the previously published Nasonia genomes [23] based on 
translated alignments. 
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NCBI RefSeq and Gnomon 

The Amel_4-5 assembly was annotated with NCBI's 
eukaryotic genome annotation pipeline (version 3.0), as de- 
scribed at [65]. The NCBI pipeline uses a repeat- masked as- 
sembly generated by WindowsMasker [66] and transcript 
and protein alignment evidence supplemented with ab 
initio prediction to annotate coding and non-coding tran- 
scripts and proteins. The evidence used for this annotation 
run included alignments of: 

1) 9 million RNAseq reads from 454 sequencing 
(described above), which were treated as ESTs in the 
pipeline. 

2) All available honey bee ESTs and mRNAs. 

3) Proteins from the FlyBase annotation of Drosophila 
melanogaster (release 5.30). 

4) Proteins from the RefSeq annotation of human. 

5) Proteins annotated on insect mRNAs. 

6) Proteins from the annotations of the ant genomes, 
Harpegnathos saltator and Camponotus floridanus 
available in GenBank. 

The final RefSeq annotation included models that were 
completely or partially supported by alignment evidence. 
The pipeline had some provisions to predict models that 
were disrupted by frameshifts or stop codons in the assem- 
bly, some of which were assigned protein accessions (XP_ 
prefixes) and named with the prefix 'LOW QUALITY 
PROTEIN,' whereas others were conservatively classified as 
pseudogenes and not assigned protein accessions. The 
RefSeq annotation is available from NCBI's genomes FTP 
site as NCBI build 5.1 [63]. 



AUGUSTUS 

AUGUSTUS can be used as an ab initio gene prediction 
tool, but can also integrate extrinsic evidence from vari- 
ous sources [67]. 

Generation of training gene structures and training 
AUGUSTUS 

We used Scipio [68] to generate gene structures on the 
Amel_4.5 assembly with a protein set from a previous A. 
mellifera annotation (Amel_pre_release2_OGS_pep.fa). 
We used these gene structures to optimize AUGUSTUS 
parameters for A. mellifera, and constructed UTR models 
from 78,274 A. mellifera ESTs from GenBank, which were 
used to train AUGUSTUS UTR parameters. We then pre- 
dicted genes in the Amel_4.5 assembly. Individual RNA- 
seq reads from 454 sequencing (described above) were 
mapped against predicted transcripts, and fully covered 
transcripts (2,151) were selected as training genes for opti- 
mizing a final AUGUSTUS parameter set. 



Gene prediction with AUGUSTUS 

We generated three gene sets using AUGUSTUS: a gene 
set with extrinsic evidence from ESTs and RNAseq data 
(AU9), a particularly inclusive gene set that contained 
many alternative transcripts for peptide identification 
(AUll), and a gene set with extrinsic evidence from 
ESTs, RNAseq data and peptide data (AU12) (Table S6 
in Additional file 1). 

We created extrinsic evidence "hints" for protein cod- 
ing genes and transcripts from the A. mellifera ESTs and 
from 454 transcriptome libraries. Genes were predicted 
with AUGUSTUS, allowing the prediction of alternative 
transcripts and allowing the splice site AT-AC (in 
addition to GT-AG and GC-AG) in case of supporting ex- 
trinsic evidence. The resulting gene set was named AU9. 
AU9 genes were predicted using the following options: 
2augustus -species = honeybeel -UTR = on -print_utr = 
on -hintsfUe = all_but_no_peptide.hints -extrinsicCfgFUe = 
extrinsic.M.RM.E.W.cfg -exonnames = on -codingseq = on - 
altematives-from-evidence = true -allow_hinted_splicesites = 
atac genomefe 

For the purpose of peptide identification, alternative 
transcripts were extensively sampled with AUGUSTUS, 
resulting in the gene set AUll. AUll genes were 
predicted issuing the following command: augustus - 
UTR = on -print_utr = on -hintsfUe = all_but_no_peptide. 
hints -extrinsicCfgFile = extrinsic.M.RM.E.W.cfg -exon- 
names = on -codingseq = on -species = honeybeel -alter- 
natives-from-evidence = true -alternatives-from-sampling = 
true -sample = 100 -minexonintronprob = 0.1 -minmeanex- 
onintronprob = 0.4 -maxtracks = -1 -allow_hinted_splice- 
sites = atac genome.fa 

We generated hints from peptide sequences by map- 
ping the peptides against the protein set of AUll and 
against a six-frame translation of the genome using 
BEAT [69] in a non-redundant way (i.e. parts of the six- 
frame translation that were included in the protein set 
AUll and redundant parts of the AUll protein set that 
were removed). A final AUGUSTUS gene set AU12 was 
generated using all available extrinsic evidence and the fol- 
lowing options: augustus -species = honeybeel -UTR = 
on -print_utr = on -hintsfile = all.hints -extrinsicCfgFOe = 
extrinsic.M.RM.E.W.cfg -exonnames = on -codingseq = on - 
alternatives-firom-evidence = true -allow_hinted_splicesites = 
atac genome.fa. 

Fgenesh++ 

Predictions were made using FGENESH 3.1.1 [70,71] 
using the HBEE matrix with parameters specific for 
A. mellifera. We used Illumina transcriptome data 
from A. mellifera forager and nurse brains available 
from the SRA (SRP003528), which were a total of 
181.8 M spots, 18. 3G bases of 100 bp single end reads 
generated on an Illumina Genome Analyzer II in 2010. 
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We mapped the Illumina RNASeq using the ReadsMap 
program [72], which provided intron position informa- 
tion to Fgenesh for building sample-specific gene 
models. Predictions were made based on individuals 
used in Illumina RNASeq libraries (five nurses and five 
foragers), and then were combined into forager and 
nurse group predictions, and redundant genes (those 
having coinciding coding sequences) were removed 
from each set. Finally, the forager and nurse predic- 
tions were combined into a final set of predictions, and 
redundant genes (those having coinciding coding se- 
quences) were removed from this combined set to pro- 
duce the final Fgenesh+-i- prediction set. 

GenelD 

GenelD is an ab initio gene prediction program used to 
find potential protein-coding genes in anonymous gen- 
omic sequences. The training of GenelD to obtain a par- 
ameter file for A. mellifem was based on the method 
described to obtain a Drosophila melanogaster GenelD 
parameter file [73]. Training was performed in a "semi- 
automated" manner by employing a recently developed 
GenelD training tool that computes position weight 
matrices (PWMs) or Markov models of order 1 for 
splice sites and start codons, and derives a model of cod- 
ing DNA, which, in this case, is a Markov model of order 
5. Furthermore, once a preliminary species-specific matrix 
is obtained it is further optimized by adjusting two in- 
ternal matrix parameters: -the cutoff of the scores of the 
predicted exons (eWF) and the ratio of signal to coding 
statistics information to be used (oWF). 

The initial A. mellifem training set was comprised of the 
2,151 gene models used to train AUGUSTUS (described 
above). Of these gene models 80% (1,720) were used to 
train GenelD while the remaining 20% (431) were set aside 
to test the accuracy of the newly developed matrix. The 
1,720 A. mellifera protein-coding gene models included 
7,913 canonical donor splice sites/7,939 canonical acceptor 
sites and 1,720 start codons. The start codons were used to 
compute PWMs while the donor and acceptors were used 
to derive Markov matrices of order 1. Given the large num- 
ber of sequences we also had enough coding (2,338,800) 
and non-coding (5,561,154) nucleotides to derive a Markov 
of order 5 for the coding potential. We tested accuracy of 
the GenelD A. mellifera parameter file on an artificial con- 
tig consisting of the 431 evaluation-set concatenated gene 
models with 800 nucleotides of intervening sequence be- 
tween each of the genes (Table S7 in Additional file 1). We 
then used the GenelD parameter files to predict genes on 
an assembly consisting of Amel_4-5 sixteen chromosomes 
and 5,304 "unplaced" scaffolds files that had repeat se- 
quences masked using Repeatmasker. GenelD predicted 
24,554 protein-coding genes. 



SGP2 

SGP2 is a syntenic gene prediction tool that combines ab 
initio gene prediction (GenelD) with TBLASTX searches 
between two or more genome sequences to provide both 
sensitive and specific gene predictions, and it tends to im- 
prove GenelD's performance, especially by reducing the 
number of false-positive predictions. SGP2 requires one 
or more reference genomes to which the target genome 
(in this case A. mellifera) is compared. We decided to use 
the genomes of Nasonia vitripennis, N. giraulti and N. 
longicornis [23] as references to develop our A. mellifera 
parameter file for SGP2 because the genus Nasonia is at 
an appropriate evolutionary distance from Apis such that 
mostly the coding regions of the genes, not the introns or 
intergenic regions, are significantly conserved between 
these two genomes. 

Obtaining the SGP2 A. mellifera-specific parameter 
file was based on the methodology described by Parra 
et al [74] used to obtain a human SGP2 parameter 
file using mouse homology evidence. The starting 
point to obtain a parameter file for SGP2 was the 
previously described GenelD A. mellifera matrix. The 
GenelD-derived SGP2 parameter file was optimized 
on an artificial contig comprising the same concatenated 
1,720 sequences used to train GenelD, with 800 nucleo- 
tides between each of the gene models. We optimized the 
SGP2 matrix by modifying not only the eWF internal par- 
ameter (as previously for the GenelD parameter file) but 
also two SGP2-specific internal parameters ("NO_SCORE" 
and "HSP_factor"). The "NO_SCORE" parameter provides 
a penalty for no overlap between TBLASTX-derived HSPs 
(High-Scoring Pairs) and GenelD ab initio predictions in 
the same region whereas the "HSP_factor" parameter re- 
duces the score assigned to the HSPs in order to maximize 
the prediction accuracy. We evaluated the newly devel- 
oped SGP2 parameter file on the same artificial contig 
consisting of the 431 concatenated gene models with 800 
nucleotides of intervening sequence between each of the 
genes used to evaluate the GenelD bee matrix (Table 1). 
We then used the SGP2 parameter file2 to predict genes 
on the same assembly file used for GenelD, and generated 
20,179 predictions. 

N-SCAN 

We used the N-SCAN package [75] to leverage conser- 
vation between the A. mellifera genome and genomes of 
two other bee species, A. florea (Aflo_1.0 ) and Bombus 
terrestris (Bter_1.0). We first masked Amel_4.5 for sim- 
ple sequence repeats using RepeatMasker [64]. We ran 
LASTZ [76] using default parameters with Amel_4.5 as 
the target genome, and either Apis florea (Aflo_1.0) or 
Bombus terrestris (Bter l.O) as the informant genome. 
We then used iParameterEstimation to generate both an 
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Amel_4-5-Aflo_1.0 specific parameter set as well as an 
Amel_4.5-Bter_1.0 parameter set using the training set 
described above for AUGUSTUS gene prediction, in- 
cluding UTR features. Finally, we ran N-SCAN using 
each of the A. mellilfera specific parameter sets with the 
respective LASTZ informant alignments to produce two 
N-SCAN prediction sets, one set based on Aflo_1.0 as 
the informant genome and the other set based on 
Bter l.O as the informant genome. 

Combining gene sets 

Input data for combined gene sets 

We used MAKER2 and GLEAN to generate combined 
gene sets. The MAKER2 and GLEAN analyses used the 
same set of input data. Both analyses combined the 
gene predictions described above (NCBI, AUGUSTUS, 
Fgenesh++ with RNAseq, N-SCAN using A. florea as an 
informant genome, GenelD and SGP2) with transcript and 
protein homolog alignments. Transcript data included the 
new 454 transcriptome data described above, A. mellifera 
ESTs from GenBank and lUumina nurse and forager reads 
downloaded from the SRA (SRP003528, described above). 

We aligned lUumina reads to Amel_4.5 in two groups, 
nurse and forager, using Tophat version 1.3.1 with the 
option "-butterfly-search" for more sensitive splice junc- 
tion detection, and then generated predicted transcripts 
for each set of pooled data using Cufflinks version 1.0.3 
with default parameters. The 454 reads were assembled 
into contigs de novo using Newbler (2.3-PreRelease-9/ 
14/2009) with the cDNA option. We aligned 454 contigs 
and ESTs to Amel_4.5 using MAKER2 v2.15, which uses 
AXTJ-BLAST [77] and Exonerate est2genome [78], with 
minimum 80% alignment coverage and 95% identity. 
Protein homolog alignments included SwissProt [79] 
Metazoa homologs, Drosophila melanogaster (fruit fly; 
r5.31) [80], Nasonia vitripennis (parasitoid wasp; 
OGSvl.2) [23] and the ants: Atta cephalotes (OGSvl.l) 
[22], Camponotus floridanus (OGSv3.3) [18], Harpeg- 
nathos saltator (OGSv3.3) [18], Linepithema humile 
(OGSvl.l) [20], Pogonomyrmex barbatus (OGSvl.l) 
[21]). Proteins in the SwissProt dataset annotated as 
transposable elements were removed prior to alignment. 
We aligned protein sequences to Amel_4.5 using Exone- 
rate protein2genome with a minimum 60% percent 
identity and 60% alignment coverage. 

MAKER2 

To create a combined gene set, we ran MAKER2 v2.15 
using parameters min_contig = 1000 and pred_gff, which 
allowed us to provide as input the gene prediction sets 
and alignments described above, instead of generating 
new evidence tracks within MAKER2. 



GLEAN 

We ran GLEAN [17] 32 times to create consensus gene 
sets using different combinations of the gene prediction 
sets described above. All of the GLEAN runs included 
the transcript and protein homolog alignments described 
above, and required a minimum coding sequence length 
of 75 nt. 

Selecting the new official gene Set 

We evaluated the 32 GLEAN sets based on several cri- 
teria including overlap with a conservative evidence- 
based set (RefSeq), transcript sequences, peptides and 
the CEGMA conserved core set [30] (Additional file 2). 
NCBI's RefSeq pipeline has been considered reliable and 
relatively conservative, so we performed overlap analyses 
to determine how many of the RefSeq models were cap- 
tured in the GLEAN consensus sets. We used PASTA 
[81] to align GLEAN coding sequences with RefSeq cod- 
ing sequences where a perfect alignment required 99% 
identity over the entire lengths of both sequences. We 
used FASTA to align peptides to GLEAN proteins, with 
E-value and scoring matrix optimized for short exact 
matches (E-value .01 and MDIO scoring matrix). We 
parsed peptide alignments to count those with 100% 
identity and 100% alignment coverage (Additional file 2). 

In addition to alignments between GLEAN predictions 
and RefSeq genes and peptides, we considered numbers 
of gene models, total numbers of coding nucleotides, 
average coding sequence lengths, and numbers of splits 
and merges compared to RefSeq. We selected the 
GLEAN31 set (Additional file 2) to be the official gene 
set, OGSv3.2. It was the set generated using gene predic- 
tions from NCBI Gnomon, AUGUSTUS, Fgenesh-i--i-, 
N-SCAN using A. florea as an informant genome and 
GenelD and SGP2 combined as a single set as well as the 
transcript and protein homolog alignments. 

Adding UTRs to OGSv3.2 gene models 

Although we did not use MAKER2 to generate the offi- 
cial gene set, we did use MAKER2 (v2.15) [28] to add 
UTR to the GLEAN coding exons since GLEAN does 
not produce annotations with UTR features. OGSv3.2 
coding exons were input as "pred_gff ' and transcriptome 
evidence (including the 454 transcripts, Illumina contigs 
and A. mellifera ESTs described above) was input as 
"est". MAKER2 aligned the EST evidence to Amel_4.5 
using WU-BLAST and Exonerate est2genome then used 
overlapping EST evidence to extend OGSv3.2 models to 
include UTR features when possible. Because MAKER2 
sometimes modified the coding exon coordinates, we 
processed the MAKER2 output and retained UTRs only 
when the coding exon structure was unchanged. Out of 
a total of 15,314 OGSv3.2 genes, UTR were added to 
7,514 genes (49%). 
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Identifying and cliaracterizing of new genes 
Identifying of new and previously known OGSv3.2 genes 

In order to compare the newest gene set (OGSv3.2) with 
the first ofiicial gene set (OGSvl.O), we mapped OGSv3.2 
coding sequences to Amel_2.0, which was the assembly on 
which OGSvl.O was generated [1]. First, we used Mega- 
BLAST [82] to identify scaffold/gene matches with 95% 
identity and E-value < 1 x 10'^". We then aligned coding se- 
quences to matching scafiblds using GMAP [83], and 
parsed the output to create two sets of splice-modeled 
alignments, both requiring 95% identity. One set was based 
on a relaxed coverage criterion, requiring that the align- 
ment cover at least 50% of the coding sequence. The other 
set was based on a stringent coverage criterion, requiring 
that the alignment cover 80% of the coding sequence. Re- 
sults of further analyses for the relaxed mapping set are 
provided in the supplemental materials, but we discuss only 
results for the stringent mapping. 

On the basis of mapping OGSv3.2 to Amel_2.0 and 
overlap between OGSv3.2 and OGSvl.O gene models on 
the Amel_2.0 assembly, we divided 15,314 OGSv3.2 
genes into three sub-sets (Table 6). We deemed any 
OGSv3.2 gene that did not align to the Amel_2.0 assem- 
bly a "Type I New" gene. The additional sequencing and 
reassembly of the genome for the Amel_4.5 assembly 
likely allowed the detection of these genes. "Type II 
New" genes were those that did align to the Amel_2.0 
assembly, but whose coordinates did not overlap an 
OGSvl.O gene by a single coding base pair on the same 
strand. Additional expressed sequence and protein 
homolog evidence as well as improvements to gene pre- 
diction algorithms likely contributed to the detection of 
these genes. Finally, any OGSv3.2 gene that both aligned 
to the Amel_2.0 assembly and overlapped an OGSvl.O 
gene was deemed a "Previously Known" gene. 

Coding sequence length analysis 

The total length of the coding sequence was calculated 
for each gene and the means for all genes and each gene 
sub-set were calculated (Table 6). We tested the null hy- 
potheses that the mean coding sequence lengths of Type 
I and Type II New genes and Previously Known genes 
were equal. 

Splice site and single versus multiple coding exon 
gene analysis 

We assessed the genomic sequence of the two intronic base 
pairs adjacent to each coding sequence exon-intron splice 
site to determine whether they corresponded to the canon- 
ical ...]5'-GT/AG-3'[... splice site sequence. We considered 
only splice sites supported by matching intron coordinates 
of spliced transcript alignment evidence. We used chi- 
square tests with one degree of freedom to compare the 
frequencies of single coding exon genes and non-canonical 



splice sites in Type I or Type II New genes with Previously 
Known genes. 

OGSv3.2 gene location relative to GC compositional 
domains 

We used IsoPlotter, a recursive segmentation algorithm 
[84,85], to partition the A. mellifera genome into GC 
compositionally homogeneous domains, contiguous re- 
gions of the genome with similar GC content (percent 
G + C nucleotides). We determined the GC content of 
the GC compositional domain for each OGSv3.2 gene. 
In cases where a gene spanned multiple GC compos- 
itional domains, we calculated the average GC content 
of the GC compositional domains, and weighted it ac- 
cording to the fraction of the gene's length that occurs 
in each GC compositional domain. We computed the 
weighted percent GC based only on non-coding nucleo- 
tides of the compositional domains, because we did not 
wish to include effects of codon bias. 

We compared the distribution of Type I and Type II 
New genes with respect to GC content to that of Previ- 
ously Known genes. IsoPlotter cannot segment scaffolds 
less than 10 kb into compositional domains, so the 90 
genes residing in these short scaffolds were not consid- 
ered. We tested the null hypotheses that the means of 
the weighted GC compositional domain contents for 
genes in the Type I and Type II New gene sets were 
equal to the mean of the Previously Known genes. 

Effective number of codons analysis 

Using the chips program within the EMBOSS package 
[86], we calculated the effective number of codons sepa- 
rately for each OGSv3.2 gene. We tested the null hypo- 
theses that the mean effective number of codons values 
for the Type I and Type II New genes were equal to the 
mean of the Previously Known genes. 

Tissue expression analysis 

We determined the number of OGSv3.2 gene overlaps 
to transcript alignments that were used in creating the 
GLEAN consensus gene sets (454 reads, Illumina con- 
tigs and A. mellifera ESTs from GenBank, described 
above). We relied on splice signals to determine the di- 
rectionality of a transcript read. We tallied spliced and 
unspliced alignments separately, since we could be 
confident when directionality of a spliced alignment 
agreed with a gene prediction, but could not be confident 
about unspliced alignments. For spliced transcript align- 
ments, if the transcript was on the opposite strand from 
the gene then it was discarded from further analysis. For 
transcripts on the same strand or transcripts that were 
un-spliced, in which case directionality could not be de- 
termined, a coordinate overlap of at least one coding base 
pair was required for a gene to count as overlapping with 
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a transcript alignment. We counted the number of tran- 
script data sets in which each OGSv3.2 gene was found 
to have an overlapping alignment (Table 6.) We per- 
formed chi-square tests with one degree of freedom to 
compare the frequencies of spliced transcript overlap in 
the Type I or Type II New gene sets with the Previously 
Known gene set. 

Of genes that overlapped spliced transcript align- 
ments, we identified genes that were narrowly expressed 
and genes that were broadly expressed on the basis of 
overlap to the four single-tissue libraries (brain [com- 
bined Illumina forager and nurse brain libraries] and 
454 libraries of mixed antennae, ovary and testes). (For- 
agers and nurses are worker honey bees that specialize 
on collecting food and feeding brood, respectively.) 
Genes were deemed narrowly expressed if they over- 
lapped at least one transcript alignment in only one of 
the four tissues and broadly expressed if they over- 
lapped at least one transcript alignment from all four 
tissues. We performed chi-square tests with one degree 
of freedom to compare the frequencies of narrowly 
expressed genes and broadly expressed genes in the 
Type I or Type II New gene sets with the Previously 
Known gene set. 

Homolog analysis 

We determined the number of protein alignments over- 
lapping OGSv3.2 for protein data sets that were used in 
creating the GLEAN consensus gene sets. We required 
overlap of at least one coding base pair on the same 
strand to deem a gene overlapping with a protein homo- 
log alignment. We performed chi-square tests with one 
degree of freedom to compare the frequency of the exis- 
tence of overlaps to homolog alignments for Type I or 
Type II New genes with that of Previously Known genes. 

TBLASTN alignment of OGSv3.2 to A. florea and B. terrestris 
genomes 

We aligned OGSv3.2 protein sequences to the A. florea 
(Aflo_1.0) and B. terrestris (Bter l.O) genome assemblies 
using TBLASTN [87] with an E-value criteria of 1 x 10'°'^. 
We did not use information about predicted genes or ex- 
pression in A. florea or B. terrestris. 

Statistical methods to test for differences in means between 
new and previously known genes 

To test the null hypothesis that the mean of a particular 
variable for Type I or Type II New genes was equal to 
the mean of the Previously Known genes, we used both a 
non-parametric Kolmogorov-Smirnov test and a Welch 
t-test with the correction for non-homogeneity of vari- 
ances. Testing the hypotheses in this way avoids assum- 
ing these data were normally distributed or had equal 



variances. To be conservative, we report the least signifi- 
cant P-value for each test. 

Using peptide data in development and analysis of gene 
sets 

We used the peptide data is described above to evaluate 
the GLEAN sets and analyze Type I New, Type II New, 
and Previously Known genes. We aligned the peptide se- 
quences to predicted protein sequences with FASTA 
[81] with a relaxed E-value of 0.1 and the MDIO scoring 
matrix. These parameters were found to allow matches 
to peptide sequences as short as 6 amino acids with 
100% identity. Only alignments with 100% identity were 
retained. We used chi-square tests with one degree of 
freedom to compare the frequency of proteins that 
aligned to peptide sequences in Type I or Type II New 
genes with that of Previously Known genes. 

The venom peptide data were used separately to evalu- 
ate gene sets for improved identification of venom 
genes. All significant and top ranking venom peptides 
from the Mascot output, with an ion score >30 were 
retained in the final peptide lists. Venom mass spectra 
were searched against the OGSv3.2 and OGSvl.O to 
compare the contribution of both datasets to the identi- 
fication of new venom genes. The false discovery rates of 
both searches were set to 1%. 

Venom peptide data were used to annotate venom 
genes using the Apollo annotation tool [36], provided 
for Amel_4.5 by the Hymenoptera Genome Database 
[88]. 

Assessing orthology 

The protein-coding gene annotations from the two A. mel- 
Ufera genome assemblies were compared with orthologs 
from OrthoDB [89] from nine other insects. These included 
Pediculus humanus (PhumUl.2, 10,772 genes), Acyrthosi- 
phon pisum (ACYPI v2.1, 36,275), Nasonia vitripennis 
(nvit2, 24,369), Linepithema humile (OGSvl.2, 16,048), 
Pogonomyrmex barbatus (OGSvl.2, 17,100), Tribolium cas- 
taneum (Teas 3.0, 16,565), Danaus plexippus (OGS2, 
15,329), Anopheles gambiae (AgamP3.6, 12,670), and Dros- 
ophila melanagaster (r5.45, 13,927). For annotations on the 
older A. melUfera genome assembly, we used the gene set 
known to the community as "Amel_prerelease2" (abbrevi- 
ated as V2), available from BeeBase. It was the gene set 
resulting from mapping OGSvl.O from assembly Amel_2.0 
to assembly Amel_4-0, and included a small number of 
manual annotations, with a total of 10,699 genes. Compar- 
ing each Apis annotation (V2 and OGSv3.2) to the other 
nine gene sets identified near-universal orthologous groups 
with orthologs in all but one or all but two insects. For each 
gene set. Figure 4 and Table S5 in Additional file 1 show 
counts of near-universal insect orthologous groups that are 
missing orthologs in different species. Total counts were 
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partitioned into groups with only single-copy orthologs and 
those with gene duplications, further divided into those 
with only one missing species and those with two missing 
species. 

Predicting protein functions 
GO analysis 

We used FASTA [81] with an E-value threshold of 1 x 
10' to compute reciprocal alignments between OGSv3.2 
proteins and a D. melanogaster protein set consisting of 
the longest protein isoform of each gene (annotation 
version r5.42). We identified reciprocal best hits (RBH) 
and transferred Gene Ontology (GO) [90] annotations 
from the D. melanogaster protein to the A. mellifera 
protein for each RBH pair, using the GO annotation file 
available at FlyBase [80]. We used GeneMerge [91] to 
test for enrichment of GO terms in OGSv3.2 protein 
datasets, testing for each of the three GO ontologies 
(Molecular Function, Biological Process and Cellular 
Component) separately. Several tests were performed 
with either the entire set of Gene Ontology terms, the 
generic GO slim set, or the GO slim set developed for 
A. mellifera by Whitfield et al. [92]. Population and test 
gene datasets for a particular GO ontology included only 
OGSv3.2 genes with GO annotations for that ontology. 
Population datasets consisted of all OGSv3.2 genes with 
GO annotations. Test datasets were 1) all new genes 
based on stringent mapping criteria, 2) Type I New 
genes using stringent criteria, 3) Type II New genes 
based on stringent mapping criteria. 

InterPro analysis 

We used InterProScan [93] to compare OGSv3.2 and 
OGSvl.O proteins with the following InterPro [29] protein 
domain and motif databases: PFAM [94], TIGRFAMS [95], 
SMART [96], PRODOM [97], PROSITE [98], PIRSF [99], 
GENE3D [100], SUPERFAMILY [101], and PANTHER 
[102]. The total numbers of proteins annotated with at least 
one InterPro domain were 9,479 and 8,552 for OGSv3.2 
and OGSvl.O, respectively. For each InterPro domain iden- 
tified in the combined datasets, we determined the number 
of proteins containing that domain within OGSv3.2 and 
OGSvl.O. Then for each InterPro domain, we used 2x2 
chi-square tests with Yates correction and one degree of 
freedom to determine whether the frequencies of proteins 
containing that domain differed between the OGSv3.2 and 
OGSvl.O InterPro-annotated sets (total 9,479 and 8,552 
proteins, respectively). 

Detecting genome-wide repetitive elements 

We detected and annotated repetitive elements with the 
REPET software package ([103], version 2.0) consisting 
of two pipelines integrating a set of bioinformatics 



programs. First, repeated sequences were detected by 
similarity, using an all-by-all BLAST [104] search via 
BLASTER [105]. LTR-retrotransposons were detected by 
structural search with LTRharvest [106]. The similarity 
matches were clustered with GROUPER [105], RECON 
[107] and FILER [108], the structural matches were clus- 
tered with NCBI BLASTclust [109]. From each cluster a 
consensus sequence was generated by multiple align- 
ment with Map. We analyzed the consensus sequences 
for terminal repeats (TRsearch), tandem repeats (TRF), 
open reading frames (dbORF.py, REPET) and poly-A 
tails (polyAtail, REPET). We screened the consensuses 
for matches to nucleotide and amino acid sequences 
from known transposable elements (RepBase 17.01, [42]) 
using BLASTER [105], which runs tblastx and blastx 
[87], and searched them for HMM profiles (Pfam data- 
base 26.0, [94]) using HMMER3 [110]. Based on the de- 
tected structural features and homologies, we classified 
the consensuses using PASTEC according to Wicker 
et al. [111]. We then removed redundancies identified 
with BLASTER and MATCHER [105] as well as ele- 
ments classified as simple sequence repeats (SSRs; >0.75 
SSR coverage) or unclassified elements built from less 
than 10 fragments. 

We used the set of de novo detected repetitive ele- 
ments to mine the genome in a second pipeline with 
BLASTER (using NCBI BLAST, sensitivity 4, followed 
by MATCHER), RepeatMasker (using CrossMatch, sen- 
sitivity q, cutoff at 200) and CENSOR (using NCBI 
BLAST). We removed false positive matches using an 
empirical statistical filter. Satellite repeats were detected 
with TRF [112], MREPS [113]and RepeatMasker [64] 
and were then merged into a single set. We also 
screened the genomic sequences for matching nucleotide 
and amino acid sequences from known transposable ele- 
ments (RepBase 17.01, [42]) with BLASTER. Finally we 
removed TE doublons (loci annotated as multiple trans- 
posable elements) and SSR annotations within trans- 
posable element annotations, and performed a "long join 
procedure" to connect distant fragments. Sequences 
from the de novo repetitive element library, which were 
found to have at least one perfect match in the genome 
were then used to rerun the whole analysis. 

To ensure compatibility and to avoid introducing a bias, 
we refrained fi"om a manual curation or clustering of the 
de novo detected elements before mining the genome. 
However, post hoc we manually analyzed all elements, 
which were previously classified into class I retrotransposon 
or class II DNA transposon elements or unclassified ele- 
ments with detected coding element features (similarity to 
known transposable elements) due to potential chimeric in- 
sertion. At this stage we excluded derivative elements 
(LARD, TRIM, MITE) from fiirther detailed inspection un- 
less carrying a class I or class II element. Some elements 
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were classified "potential Hostgene" in the computational 
analysis, based on characteristics of the DNA, not based on 
overlap analysis with predicted genes. These "potential 
Hostgene" elements, as well as unclassified elements 
("noCat"), were also excluded from manual analysis. We 
performed manual inspection by checking for open-reading 
frames (ORF) with the NCBI ORF Finder (NCBI), by 
searching the NCBI Conserved Domain Database (CDD) 
[114], by searching the most up to date online RepBase 
database (accessed December 2012-February 2013) via 
CENSOR [115]. We also performed phylogenetic analysis 
for LINE RT domains with RTclassl [116] in order to 
achieve a detailed classification for each element, determine 
its potential relation to a family of known elements, to 
evaluate the completeness and to detect potential active ele- 
ments. We defined an element to be complete, if it pos- 
sessed the relevant coding parts with the element-typical 
domains and the structural features (LTR, TIR). The poten- 
tial activity was defined according to the region an intact 
ORF, if present, covered. If an intact ORF seemed to cover 
a complete region including the typical domains (e.g. GAG 
as well as POL, Tase) then the element was considered to 
be potentially active. If a Tase domain was covered by a 
truncated ORF or the Tase itself appeared to be truncated 
but was covered by an intact ORF, or if the RT domain was 
covered by an active ORF but not the remaining element- 
typical domains, then the element was considered to be 
maybe potentially active. During the manual classification 
to at least superfamily level, novel transposable element 
types not covered by the system of Wicker et al. [Ill] were 
also considered: Kolobok, Sola, Chapaev, Ginger, Academ, 
Novosib and ISL2EU class II DNA transposons [117,118]. 

Simple sequence repeats and other low complexity re- 
gions were extracted from the REPET pipeline database 
and processed with a custom Perl script to calculate the 
total coverage of these types of repetitive DNA by omit- 
ting overlaps with transposable element or other repe- 
titive element annotations. 

Ethical approval 

Experimental research followed appropriate guidelines for 
the ethical treatment of research subjects. Human subject 
and animal protocol approvals were not applicable. 

Data availability 

The 454 transcript read data are available in the Sequence 
Read Archive (SRA) [119] at the NCBI (SRP003261, 
SRP003260, SRP001899). The assembled 454 transcripts 
(119,959) are in the Transcriptome Shotgun Assembly 
(TSA) database, and available through NCBI BioProject 
pages for PRJNA51481 and PRJNA51483 [120]. 

The accessions for the assembled 454 transcript data 
are: HP542035 to HP552088 (testes), HP527956 to 
HP542034 (mixed antennae), HP509343 to HP527955 



(embryo), HP482918 to HP509342 (brain; ovary), HP473 
811 to HP482917 (larvae), HP459439 to HP473810 (abdomen), 
and HP552089 to HP579397 (ovary). 

The SOLID genomic read data are available in the 
SRA (SRX097020). The 454 genomic read data are also 
available (SRX006752, SRX000071). 

The Amel_4.5 assembly is available from NCBI under 
the accession GCA_000002195.1. 

The following resources are available at BeeBase [121], 
a division of the Hymenoptera Genome Database [88]: 
genome browsers with gene annotations and supporting 
evidence alignments; BLAST databases with the Amel_4.5 
scaffold assembly, OGSv3.2, all input gene prediction sets 
and transcript contig assemblies; and a data download 
page with fasta sequence files and gff for annotations. 
Peptides and their tandem mass spectra are available from 
the Apis mellifem PeptideAtlas [62]. 

Additional files 



Additional file 1: Is a document containing Tables S1 through S7 
and Figure SI. Table SI. provides details for genomic sequencing runs, 
listed by SRA run number. Table S2. provides a comparison of New and 
Previously Known OGSv3.2 genes based on relaxed mapping criteria. 
Table S3, provides a comparison of frequencies of canonical and 
non-canonical splice sites in New and Previously Known genes. Table S4. 
provides the number of genes overlapping expressed sequence 
alignments for different transcript libraries. Table S5, provides counts of 
near-universal insect orthologous groups that are missing orthologs in 
each species. Table S6. provides evidence and sampling options used 
for the three AUGUSTUS gene sets AU9, AUll, and AU12. Table S7. 
provides gene prediction accuracy of GenelD and SGP2 on an A mellifera 
artificial contig. Figure SI. shows the proportions of different 
transposable element groups in the A mellifera genome. 

Additional file 2: Is a spreadsheet showing results of evaluation of 
32 consensus gene sets generated with GLEAN. 

Additional file 3: Is a spreadsheet showing sources of biological 
evidence and other characteristics for each OGSv3.2 gene, and is 
the data used in comparing Type I New, Type II New and Previously 
Known genes. 

Additional file 4: Is a spreadsheet showing new venom proteins 
and known venom proteins with newly identified tryptic peptides 
in OGSv3.2. 

Additional file 5: Is a spreadsheet providing results of manually 
annotating venom peptides. 

Additional file 6: Is a spreadsheet providing reciprocal best hit 
orthologs between A. mellifera and D. melanogaster and gene 
ontology for D. melanogaster proteins. 

Additional file 7: Is a spreadsheet providing counts of InterPro 
domains in OGSv3.2 and OGSvl.O. 

Additional file 8: Is a spreadsheet providing an analysis of 1234 
OGSv3.2 genes that overlap Interspersed repeats identified by 
REPET. 



Abbreviations 

BAG: Bacterial artificial chromosome; CPLL: Combinatorial peptide ligand 
library; DIRS: Dictyostelium intermediate repeat sequence; ENG: Effective 
number of codons; EST: Expressed sequence tag; ETIGR: Fourier transform 
ion cyclotron resonance; GAG: GAG-Protein of retrotransposons; GO: Gene 
Ontology; HMM: Hidden IVIarkov model; HSP: High scoring pair; LARD: Large 
retrotransposon derivative; LC: Liquid chromatography; LINE: Long 
interspersed element; LTR: Long terminal repeat; MITE: Miniatiure 



Elsik ef al. BMC Genomics 2014, 15:86 
http://www.biomedcentral.com/1471-2164/15/86 



Page 26 of 29 



inverted-repeat transposable element; IVIS/IVIS: Tandem mass spectrometry; 
nt Nucleotides; NTR: New transcribed region; OGS: Official gene set; 
ORF: Open reading frame; POL: POL-polyprotein of retrotransposons; 
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miniature; UTR: Untranslated region. 

Competing interests 

The authors declare that they have no competing interests. 
Authors' contributions 

Manuscript: CGE'' ^, KCW^, AKB'^. Human Genome Sequencing Center 
Director: RAG^. Honey Bee Genome Phase II White Paper Authors: JDE'° IVIB^ 
CGE'' ^, RM^°, HIMR^", GER^^ DBW^° CWW^\ Sequence production: YW^ 
(cDNA sequencing libraries at BCM-HGSC), lEN^ {cDNA sequencing libraries at 
BCM-HGSC), CLK^ IVIH^ VJ^ DMIVl^ HGSC production teams^ GER^^ (cDNA 
sequencing at U. Illinois). cDNA: RM^° (Brain & Ovary, Testes, Mixed Antennae, 
Embryos, Larvae), GER^' (Queen Ovary, Nurse, Eorager), IMEH" (N urse and 
Eorager Brain), RSK" (Nurse and Eorager Brain), JDE'" (Abdomen), JD^ 
(transcript analysis and assembly), DZ^ (transcript analysis). Genome 
Assembly: KCW^ LZ^ Hj"^, JD^ Genetic IVlap: GJH^^ JMT^", JIVI^'', OR^^ ES^\ 
Gene Annotation: CGE^' ^ (N-Scan, MAKER, GLEAN), AKB^ (N-Scan, MAKER), 
TDM^^ (RefSeq, Gnomon), FC^ (SGP2, GenelD), RG^ (SGP2, GenelD), KJH" 
(AUGUSTUS), PK"* (Egenesh-H-), VS^' (Egenesh+-f), MS" (AUGUSTUS). OGS 
Evaluation and Characterization: CGE'' ^, AKB^. GC Composition: CGE'' ^, 
AKB^, EE', DG", JTR^' Orthology Assessment: RMW^"*, EMZ^"*. Protein 
Functional Analysis (GO and InterPro): CGE^' ^. Genomic Repetitive Elements: 
ES^\ REAM^V Peptide data and analysis: MW' (venom peptides), LJF" 
(PeptideAtlas), GD° (venom peptides), BD^ (venom peptides), DCdG' (venom 
peptides). Genome Browser, BLAST Website and Manual Annotation 
Software: CGE'' ^ AKB^ CPC^' MCM-T^' JTR^' All authors read and 
approved the final manuscript. 

Acl<nowledgements 

Funding for the project was provided by a grant to RG from the National 
Human Genome Research Institute, National Institutes of Health (NHGRI, NIH) 
U54 HG003273. Contributions from members of the CGE lab were supported 
by Agriculture and Food Research Initiative Competitive grant no. 2010- 
65205-20407 from the USDA National Institute of Food Agriculture. AKB was 
supported by a Clare Luce Booth Fellowship at Georgetown University. The 
authors are grateful for the HGSC sequence production teams (PatifS., Gub- 
bala.S., Aqrawi,P., Arias,F., Bess,C., Blankenburg,K.B., Brocchini,M., Buhay,C., 
Challis,D., Chang,K, Chen.D., Coleman,P., Drummond,J., English,A., Evani,U., 
Francisco,L., Fu,Q., Goodspeed,R., Haessly.TH., Hale,W., Han,H., Holder,M., Hu,Y., 
Jackson,L., Jakkamsetti,A., Jayaseelan,J.C., Kakkar.N., Kalra.D., Kandadi,H., Lee,S., 
Li,H., Liu,Y., Macmil,S., Mandapat,C.M., Mata,R., Mathew,T., Matskevitch,T., Muni- 
dasa,M., NagaswamyU., Najjar,R., Nguyen,N., Niu,J., Opheim.D., Palculict,T., 
Paul.S., Pellon,M., Perales.L, Pham,C., Pham.P., Pu,L.-L., Oi,S., Qu,J., Ren,Y., Ruth, 
R.T, Saada,N., Sabo,A., San Lucas,F., Sershen.C, Shafer,J., Shah,N., Shelton.R., 
Song,X.-Z., Tabassum,N., Tang,L., Taylor,A., Taylor,M., Velamala,V., Wan,Z., 
Wang,L., Wang,Y., Warren,J., Weissenberger,G., Wilczek-Boney.KB., Yao,J., Yin,B., 
Yu,J., Zhang,J., Zhang, L, Zhou,C., Zhu.D., Zhu,Y., and Zou,X.), and the input of 
other members of the HGSC genome assembly team. 

Author details 

'Division of Animal Sciences, Division of Plant Sciences, and MU Informatics 
Institute, University of Missouri, Columbia, MO 5521 1, USA. ^Department of 
Biology, Georgetown University, Washington, DC 20057, USA. ^Human 
Genome Sequencing Center, Department of Molecular and Human Genetics, 
Baylor College of Medicine, MS BCM226, One Baylor Plaza, Houston, TX 
77030, USA. ''institute of Evolutionary Genetics, Heinrich Heine University 
Duesseldorf, Universitaetsstrasse 1, 40225 Duesseldorf, Germany. ^Center for 
Genomic Regulation, Universitat Pompeu Fabra, C/Dr. Aiguader 88, E-08003 
Barcelona, Catalonia, Spain. ^Division of Animal Sciences, University of 
Missouri, Columbia, MO 6521 1, USA. 'Laboratory of Zoophysiology, Ghent 
University, Krijgslaan 281 S2, B-9000 Ghent, Belgium. ^Laboratory of Protein 
Biochemistry and Biomolecular Engineering, Ghent University, KL 
Ledeganckstraat 35, B-9000 Ghent Belgium. ''Department of Mental Health, 
Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 



21205-2103, USA '°Bee Research Laboratory, BARC-E USDA-Agricultural 
Research Service, Beltsville, MD 20705, USA. "Department of Biochemistry & 
Molecular Biology, Centre for High-Throughput Biology, University of British 
Columbia, 2125 East Mall, Vancouver, BC, Canada. '^Department of Biology 
and Biochemistry, University of Houston, Houston, TX 77204-5001, USA. 
"Ernst Moritz Arndt University Greifswald, Institute for Mathematics and 
Computer Science, Walther-Rathenau-Str. 47, 17487 Greifswald, Germany. 
'''Department of Crop Sciences and Institute of Genomic Biology, University 
of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. '^Department of 
Entomology, Purdue University, 901 West State Street West Lafayette, IN 
47907-2089, USA. "^Department of Obstetrics, Gynecology & Reproductive 
Sciences, University of Pittsburgh, MAGEE 0000, Pittsburgh, PA 15260, USA. 
"High-Performance Biological Computing (HPCBio), Roy J. Carver 
Biotechnology Center, University of Illinois at Urbana-Champaign, Urbana, IL 
61801, USA. "'Softberry Inc., 116 Radio Circle, Suite 400, Mount Kisco, NY 
10549, USA. "institute for Genomic Biology and Department of 
Bioengineering, University of Illinois at Urbana-Champaign, 1270 DCL, 
MC-278, 1304 W Springfield Ave, Urbana, IL 61801, USA. ^"Research School of 
Biology, The Australian National University, Canberra ACT 0200, Australia, 
^'institut fur Zoologie, Molekulare Okologie, Martin-Luther-Universitat 
Halle-Wittenberg, Hoher Weg 4, D-06099 Halle (Saale), Germany ^^Genomics 
Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. 
^^National Center for Biotechnology Information, National Library of 
Medicine, National Institutes of Health, Building 45, 8600 Rockville Pike, 
Bethesda, MD 20894, USA. ^""Department of Entomology, University of Illinois 
at Urbana-Champaign, Urbana, IL 61801, USA. ^^Institute for Genomic 
Biology, Department of Entomology, Neuroscience Program, University of 
Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, 
USA. ^^Department of Biology, University of North Carolina at Greensboro, 
321 Mclver Street Greensboro, NC 27412, USA. ^'Computer, Electrical and 
Mathematical Sciences and Engineering Division, King Abdullah University of 
Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi 
Arabia. ^^Extension Field Operations, Clemson University, 120 McGinty Ct, 
Clemson, SC 29634, USA. ^'University of Geneva and Swiss Institute of 
Bioinformatics, CMU, Michel-Servet 1, Geneva CH-121 1, Switzerland. 
^"Genformatic, 6301 Highland Hills Drive, Austin, TX 78731, USA 
^'Department of Entomology, Neuroscience Program, Program in Ecology 
and Evolutionary Biology, University of Illinois at Urbana-Champaign, Urbana, 
IL 61801, USA 

Received: 1 1 September 2013 Accepted: 27 January 2014 
Published: 30 January 2014 

References 

1 . Honey Bee Genome Sequencing Consortium: Insights into social insects from 
the genome of the honeybee Apis mellifera. Nature 2006, 443:931-949. 

2. Zayed A, Robinson GE: Understanding the relationship between brain 
gene expression and social behavior: lessons from the honey bee. 
Anna Rev Genet 2012, 46:591-515. 

3. Begna D, Han B, Feng M, Fang Y, Li J: Differential expressions of nuclear 
proteomes between honeybee (Apis mellifera L.) queen and worker larvae: 

a deep insight into caste pathway decisions. J Proteome fies 201 2, 1 1 :1 31 7-1 329. 

4. Cornman RS, Tarpy DR, Chen Y, Jeffreys L, Lopez D, Pettis JS, vanEngelsdorp 
D, Evans JD: Pathogen webs in collapsing honey bee colonies. PLoS ONE 
2012 7:e43562. 

5. Foret S, Kucharski R, Pellegrini M, Feng S, Jacobsen SE Robinson GE Maleszka R: 
DNA methylation dynamics, metabolic fluxes, gene splicing, and alternative 
phenotypes in honey bees. Proc Natl Acad Sci USA 2012, 109:4968-4973. 

6. Li-Byarlay H, Li Y Stroud H, Feng S, Newman TC, Kaneda M, Hou KK Worley 
KC, Elsik CG, WIckline SA, et al: RNA interference knockdown of DNA 
methyltransferase 3 affects gene alternative splicing in the honey bee. 
Proc Nat Aca Sci USA 201 3, 1 1 0:1 2750-1 2755. 

7. Calderone NW: Insect pollinated crops, insect pollinators and US 
agriculture: trend analysis of aggregate data for the period 1992-2009. 
PLos One 2012 7:e37235. 

8. Gallai N, Salles JM, Settele J, Vaissiere BE: Economic valuation of the 
vulnerability of world agriculture confronted with pollinator decline. 
Ecol Econ 2009, 68:810-821. 

9. Alaux C, Dantec C, Parrinello H, Le Conte Y: Nutrigenomics in honey bees: 
digital gene expression analysis of pollen's nutritive effects on healthy 
and varroa-parasitized bees. BiAC Genomics 2011, 12:496. 



Elsik ef al. BMC Genomics 2014, 15:86 
http://www.biomedcentral.com/1471-2164/15/86 



10. Johnson RM, Mao W, Pollock HS, Niu G, Schuler MA, Berenbaum MR: 
Ecologically appropriate xenobiotics induce cytochrome P450s in Apis 
Mellifera. PLoS ONf 2012, 7:e31051. 

11. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, 
Dewar K, Doyle M, FitzHugh W, ef ah Initial sequencing and analysis of the 
human genome. Nature 2001, 409:860-921. 

12. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, 
Yandell M, Evans CA, Holt RA, ef ah The sequence of the human genome. 
Sdence 2001, 291:1304-1351. 

13. Church DM, Goodstadt U Hillier LW, Zody MC, Goldstein S, She X, Bult CJ, 
Agarwala R, Cherry JU DICuccio M, ef ah Lineage-specific biology revealed 
by a finished genome assembly of the mouse. PhoS Biol 2009, 7:el0001 12. 

14. International Human Genome Sequencing Consortium: Finishing the 
euchromatic sequence of the human genome. Nature 2004, 431:931-945. 

15. Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams 
M, Champe M, Dugan SP, Frise E, ef ah Finishing a whole-genome 
shotgun: release 3 of the Drosophila melanogaster euchromatic genome 
sequence. Genofne Biol 2002, 3:RESEARCH0079. 

16. C elegans Sequencing Consortium: Genome sequence of the nematode C. 
elegans: a platform for investigating biology. Science 1998, 
282:2012-2018. 

17. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM: 
Creating a honey bee consensus gene set. Genome Biol 2007, 8:R13. 

18. Bonasio R, Zhang G, Ye C, Mutti NS, Fang X, Qln N, Donahue G, Yang P, Li 
Q, Li C, ef ah Genomic comparison of the ants Camponotus floridanus 
and Harpegnathos saltator. Science 2010, 329:1068-1071. 

19. Nygaard S, Zhang G, Schiott M, U C, Wurm Y, Hu H, Zhou J, Ji U Qiu F, 
Rasmussen M, ef ah The genome of the leaf-cutting ant Acromyrmex 
echinatior suggests key adaptations to advanced social life and fungus 
farming. Genofne Res 2011, 21:1339-1348. 

20 Smith CD, Zimin A, Holt C, Abouheif F, Benton R, Cash E, Croset V, Currie 
CR, Elhaik E, Elsik CG, ef ah Draft genome of the globally widespread and 
invasive Argentine ant (Linepithema humile). Proc Natl Acad Sci USA 201 1, 
108:5673-5678. 

21. Smith CR, Smith CD, Robertson HM, Heimkampf M, Zimin A, Yandeii M, Holt 
C, Hu H, Abouheif E, Benton R, ef ah Draft genome of the red harvester 
ant Pogonomyrmex barbatus. Proc Natl Acad Sci L/SA 201 1 , 1 08:5667-5672. 

22. Suen G, Telling C, Li L, Holt C, Abouheif E, Bornberg-Bauer F, Bouffard P, 
Caldera EJ, Cash E, Cavanaugh A, ef ah The genome sequence of the 
leaf-cutter ant Atta cephalotes reveals insights into its obligate symbiotic 
lifestyle. PLoS Genet 201 1, 7:el 002007. 

23. Werren JH, Richards S, Desjardins CA, Niehuis 0, Gadau J, Coibourne JK, 
Beukeboom LW, Desplan C, Elsik CG, Grimmelikhuijzen CJ, ef ah Functional 
and evolutionary insights from the genomes of three parasitoid Nasonia 
species. Science 2010 327:343-348. 

24. Wurm Y, Wang J, Riba-Grognuz 0, Corona M, Nygaard S, Hunt BG, Ingram 
KK, Falquet L, Nipitwattanaphon M, Gotzek D, ef ah The genome of the fire 
ant Solenopsis invicta. Proc Natl Acad Sci U5/\ 2011 , 1 08:5679-5684. 

25. Elorea L, Souvorov A, Kalbfleisch TS, Salzberg SL Genome assembly has a 
major impact on gene content: a comparison of annotation in two Bos 
taurus assemblies. PLoS One 201 1, 6:e21400. 

26. Applied Biosystems: Whole genome sequencing of £ coli Bacteria. 
http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/ 
documents/generaidocuments/cms_057558.pdf 

27. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L McGuire A, He W, Chen YJ, 
Makhijani V, Roth GT, ef ah The complete genome of an individual by 
massively parallel DNA sequencing. Nofure 2008, 452:872-876. 

28. Holt C, Yandell M: MAKER2: an annotation pipeline and genome- 
database management tool for second-generation genome projects. 
SMC Bioinformatics 201 1, 12:491. 

29. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood IK, Bateman A, Bernard 
T, BInns D, Bork P, Burge S, ef ah InterPro in 201 1: new developments in 
the family and domain prediction database. Nucleic Acids Res 2012, 
40:D306-D312. 

30. Parra G, Bradnam K, Ning Z, Keane T, Korf i: Assessing the gene 
space in draft genomes. Nucleic Acids Res 2009, 37:289-297. 

31. BeeBase Amel_4.5 JBrowse. http://hymenopteragenome.org:8080/ 
Amei_4.5/jbrowse. 

32. Lee F, Helt GA, Reese JT, Munoz-Torres MC Childers CP, Buels RM, Stein U 
Holmes IH, Elsik CG, Lewis SE Web Apollo: a web-based genomic 
annotation editing platform. Genome Biol 2013, 14:R93. 



Page 27 of 29 



33. BeeBase Amel_4.5 Annotation Tools, http://hymenopteragenome.org/ 
beebase/?q=registration_apolio_amel. 

34. Wright F: The 'effective number of codons' used in a gene. Gene 1990, 
87:23-29. 

35. Van Vaerenbergh M, Debyser G, Devreese B, de Graaf DC: Exploring 
the hidden honeybee (Apis mellifera) venom proteome by 
integrating a combinatorial peptide ligand library approach with 
FTMS. J Proteomics 2014. in press. 

36. Lee E Harris N, Gibson M, Chetty R, Lewis S: Apollo: a community resource 
for genome annotation editing. Bioinformatics 2009, 25:1836-1837. 

37. Ovchinnikov Y, Miroshnlkov Ai, Kudelin AB, Kostina MB, Boikov VA, 
Magazanik LG, GotgllT IM: Structure and presynaptic activity of 
tertiapin, a neurotoxin from honeybee venom. Bioorganictieskaya 
Khimiya 1980, 6:359-365. 

38. GmachI M, Kreil G: The precursors of the bee venom constituents apamin 
and MCD peptide are encoded by two genes in tandem which share the 
same 3'-exon. J Biol Chem 1 995, 270:1 2704-1 2708. 

39. Robertson HM, Wanner KW: The chemoreceptor superfamily in the honey bee, 
Apis mellifera: expansion of the odorant, but not gustatory, receptor family. 
Genome Res 2006 16:1395-1403. 

40. Witte CP, Le OH, Bureau T, Kumar A: Terminal-repeat retrotransposons in 
miniature (TRIM) are involved in restructuring plant genomes. Proc Natl 
Acad Sci USA 2001, 98:13778-13783. 

41 . Kalendar R, Viclent CM, Peleg 0, Anamthawat-Jonsson K, Bolshoy A, 
Schulman AH: Large retrotransposon derivatives: abundant, conserved 
but nonautonomous retroelements of barley and related genomes. 
Genefo 2004, 166:1437-1450. 

42. Jurka J, Kapitonov W, Paviicek A, Klonowski P, Kohany 0, Walichiewicz J: 
Repbase Update, a database of eukaryotic repetitive elements. 
Cytogenet Genome Res 2005, 1 1 0:462-467. 

43. Lampe DJ, Witherspoon DJ, Soto-Adames FN, Robertson HM: Recent 
horizontal transfer of mellifera subfamily mariner transposons into insect 
lineages representing four different orders shows that selection acts only 
during horizontal transfer. Mol Biol Evol 2003, 20:554-562. 

44. Feschotte C: Transposable elements and the evolution of regulatory 
networks. Nat Rev Genet 2008, 9:397-405. 

45. international Aphid Genomics Consortium: Genome sequence of the pea 
aphid Acyrthosiphon pisum. PLoS Biol 2010 8:el000313. 

46. Beye M, Gattermeier I, Hasselmann M, Gempe T, Schioett M, Baines 
JF, Schlipatlus D, Mougel E, Emore C, Rueppell 0, ef al: Exceptionally 
high levels of recombination across the honey bee genome. 
Genome Res 2006, 16:1339-1344. 

47. Kent CE, Minael S, Harpur BA, Zayed A: Recombination is associated 
with the evolution of genome structure and worker behavior in 
honey bees. Proc Natl Acad Sci USA 2012, 109:18012-18017. 

48. Simola DE, Wissler L, Donahue G, Waterhouse RM, Heimkampf M, 
Roux J, Nygaard S, Glastad K, Hagen DE, Vlljakalnen L, ef al: Social 
insect genomes exhibit dramatic evolution in gene composition 
and regulation while preserving regulatory features linked to 
sociality. Genome Res 2013, 23:1235-1247. 

49. Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, 
Artieri CG, van Baren MJ, Boley N, Booth BW, ef ah The 
developmental transcriptome of Drosophila melanogaster. 
Nature 2011, 471:473-479. 

50. Ladoukakis E, Pereira V, Magny EG, Fyre-Walker A, Couso JP: Hundreds of 
putatively functional small open reading frames in Drosophila. 
Genome Biol 2011, 12:R118. 

51 . Kastenmayer JP, Ni L, Chu A, Ktchen LE Au WC Yang H, Carter CD Wheeler D, 
Davis RW, Boeke JD, ef ah Functional genomics of genes with small open 
reading frames (sORFs) in S. cerevisiae. Genome Res 2006, 16:365-373. 

52. i5K Consortium: The iSK initiative: advancing arthropod genomics for 
knowledge, human health, Agriculture, and the Environment. J Hered 
2013, 104:595-600 

53. Genome 10 K Community of Scientists: Genome 10 K: a proposal to obtain 
whole-genome sequence for 1 0,000 vertebrate species. J Hered 2009, 
100:659-674 

54 Fckalbar WL Hutchins FD, Markov GJ, Allen AN, Corneveaux JJ, 

Lindbiad-Toh K, Di Raima F, Aifoldi J, Huentelman MJ, Kusumi K: Genome 
reannotation of the lizard Anolis carolinensis based on 14 adult and 
embryonic deep transcriptomes. BMC Genomics 2013, 14:49. 

55. ATLAS-Link https://www.hgsc.bcm.edu/software/Atlas-Link 



Elsik ef al. BMC Genomics 2014, 15:86 
http://www.biomedcentral.com/1471-2164/15/86 



Page 28 of 29 



56. Ament SA, Chan QW, Wheeler MM, Nixon SE, Johnson SP, Rodriguez-Zas SL, 
Foster U, Robinson GE: IVlechanisms of stable lipid loss in a social insect. 
J Exp Biol 2011,214:3808-3821. 

57. Chan MM, Choi SY, Chan QW, Li P, Guarna MM, Foster LJ: Prcteome profile 
and lentiviral transduction of cultured honey bee (Apis mellifera L.) cells. 
Insect l^ol Biol 2010, 19:653-658. 

58. Chan QW, Foster U: Changes in protein expression during honey bee 
larval development. Genome Biol 2008, 9:R156. 

59. Chan QW, Howes CG, Foster U: Quantitative comparison of caste 
differences in honeybee hemolymph. Mo/ Cell Proteomia 2006, 
5:2252-2262. 

60. Chan QW, Melathopoulos AP, Pernal SF, Foster LJ: The innate immune and 
systemic response in honey bees to a bacterial pathogen, Paenibacillus 
larvae. BMC Genomic 2009, 10:387. 

61. Massa AN, Wanjugi H, Deal KR, Q'Brien K, You FM, Maiti R, Chan AP, Gu YQ, 
Luo MC, Anderson OD, et at Gene space dynamics during the evolution 
of Aegilops tauschii, Brachypodium distachyon, Oryza sativa, and 
Sorghum bicolor genomes. Mol Biol Evol 2011, 28:2537-2547. 

62. Chan QW, Parker R, Sun Z, Deutsch EW, Foster LJ: A honey bee (Apis 
mellifera L.) PeptideAtlas crossing castes and tissues. BMC Genomics 201 1, 
12:290. 

63. NCBI Apis mellifera Annotation Build 5.1. ftp://ftp.ncbi.nlm.nih.gov/ 
genomes/Apis_mellifera. 

64. RepeatMasker Open-3.0. http://www.repeatmasker.org 

65. NCBI Eukaryotic genome annotation process, http://www.ncbi.nlm.nih. 
gov/genome/annotation_euk/process. 

66. Morgulis A, Gertz EM, Schaffer AA, Agarwala R: WindowMasker: window- 
based masker for sequenced genomes. Blolnformatics 2006, 22:134-141. 

67. Stanke M, Schoffmann Q, Morgenstern B, Waack S: Gene prediction in 
eukaryotes with a generalized hidden Markov model that uses hints 
from external sources. BMC Blolnformatics 2006, 7:62. 

68. Keller 0, Odronitz F, Stanke M, Kollmar M, Waack S: Scipio: using protein 
sequences to determine the precise exon/intron structures of genes and 
their orthologs in closely related species. BMC Blolnformatics 2008, 9:278. 

69. Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res 2002, 
12:656-664 

70. Solovyev V: Finding genes by computer: probabilistic and discriminative 
approaches. In Current Topics in Computational Biology. Edited by Jiang T, Smith 
T, Xu Y Zhang M. Cambridge, MA: MIT Press; 2002:201-248. 

71. Solovyev V, Kosarev P, Seledsov I, Vorobyev D: Automatic annotation of 
eukaryotic genes, pseudogenes and promoters. Genome Biol 2006, 
7(Suppl 1):S10 11-12. 

72. ReadMap: Aligning next generation sequences to the genome sequence. 
http://linux5.softberry.com/cgi-bin/berry/programs/ReadsMap. 

73. Parra G, Blanco E, Guigo R: GenelD in Drosophila. Genome Res 2000, 
10:511-515. 

74. Parra G, Agarwal P, Abril JF, Wiehe T Fickett JW, Guigo R: Comparative 
gene prediction in human and mouse. Genome Res 2003, 13:108-1 1 7. 

75. van Baren MJ, Koebbe BC, Brent MR Using N-SCAN or TWINSCAN to 
predict gene structures in genomic DNA sequences. Curr Protoc 
Blolnformatics 2007, Chapter 4:Unit 4 8. 

76. Harris RS: Improved Palrwise Alignment of Genomic DNA. The Pennsylvania 
State University; 2007. 

77. WU-BLAST. http://blast.wustl.edu. 

78. Slater GS, Birney E: Automated generation of heuristics for biological 
sequence comparison. BMC Blolnformatics 2005, 6:31. 

79. UniProt Consortium: Update on activities at the Universal Protein 
Resource (UniProt) in 2013. Nucleic Acids Res 2013, 41:D43-D47. 

80. Marygold SJ, Leyland PC, Seal RL, Goodman JL, Thurmond J, Strelets VB, 
Wilson RJ: FlyBase: improvements to the bibliography. Nucleic Adds Res 
2012, 41:D751-D757. 

81. Pearson WR: Rapid and sensitive sequence comparison with FASTP and 
FASTA. Methods Enzymol 1990, 183:63-98. 

82. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning 
DNA sequences. J Comput Biol 2000, 7:203-214. 

83. Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment 
program for mRNA and EST sequences. Blolnformatics 2005, 21:1859-1875. 

84. Elhaik E, Graur D, Josic K Landan G: Identifying compositionally homogeneous 
and nonhomogeneous domains within the human genome using a novel 
segmentation algorithm. Nucleic Acids Res 2010, 38:el58. 



85. Elhaik E, Graur D: IsoPlotter-F: a tool for studying the compositional 
architecture of genomes. ISRN Blolnformatics 2013, 2013:725434. 

86. Olson SA: EMBOSS opens up sequence analysis. European molecular 
biology open software suite. Brief Blolnform 2002, 3:87-91. 

87. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z Miller W, Lipman DJ: 
Gapped BLAST and PSI-BLAST: a new generation of protein database 
search programs. Nucleic Adds Res 1997, 25:3389-3402. 

88. Munoz-Torres MC, Reese JT Childers CP, Bennett AK, Sundaram JP Childs 
KL, Anzola JM, Mllshina N, Elsik CG: Hymenoptera Genome Database: 
integrated community resources for insect species of the order 
Hymenoptera. Nucleic Adds Res 2011, 39:D658-D662. 

89. Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV: OrthoDB: 
a hierarchical catalog of animal, fungal and bacterial orthologs. 
Nucleic Adds Res 201 3, 41 :D358-D365. 

90. Ashbumer M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, 
Dolinskj K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the 
unification of biology. The Gene ontology consortium. Nat Genet 2000, 
25:25-29. 

91. Castillo-Davis CI, HartI DL: GeneMerge-post-genomic analysis, data 
mining, and hypothesis testing. Blolnformatics 2003, 19:891-892. 

92. Whitfield CW, Band MR, Bonaldo MF, Kumar CG, Liu L, Pardinas JR, 
Robertson HM, Soares MB, Robinson GE: Annotated expressed sequence 
tags and cDNA microarrays for studies of brain and behavior in the 
honey bee. Genome Res 2002, 12:555-566. 

93. Zdobnov EM, Apweiler R: InterProScan-an integration platform for the 
signature-recognition methods in InterPro. Blolnformatics 2001, 
17:847-848. 

94 Punta M, Coggill PC, Eberhardt RY Mistry J, Tate J, Boursnell C, Pang N, 
Forslund K, Ceric G, Clements J, ef al: The Pfam protein families database. 
Nucleic Adds Res 2012, 40:D290-D301. 

95. Haft DH, Selengut JD, White 0: The TIGRFAMs database of protein 
families. Nucleic Adds Res 2003, 31:371-373. 

96. Letunic I, Doerks T, Bork P: SMART 7: recent updates to the protein 
domain annotation resource. Nucleic Adds Res 2012, 40:D302-D305. 

97. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D: The ProDom 
database of protein domain families: more emphasis on 3D. Nuddc Adds 
Res 2005, 33:D212-D215. 

98. Sigrist CJ, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch 
A, Hulo N: PROSITE, a protein domain database for functional 
characterization and annotation. Nucleic Adds Res 2010, 38:D161-D166. 

99. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ 
Mazumder R, Kumar S, Kourtesis P, et at: PIRSF: family classification system 
at the protein information resource. Nucleic Adds Res 2004, 32:D1 12-Dl 14. 

100. Lees J, Yeats C, Perkins J, Sillitoe I, Rentzsch R, Dessailly BH, Orengo C: 
GeneSD: a domain-based resource for comparative genomics, functional 
annotation and protein network analysis. Nucleic Adds Res 2012, 
40:D465-D471. 

101. Wilson D, Pethica R, Zhou Y Talbot C, Vogel C, Madera M, Chothia C, 
Gough J: SUPERFAMILY-sophisticated comparative genomics, data 
mining, visualization and phylogeny. Nucleic Adds Res 2009, 
37:D380-D386. 

102. Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD: PANTHER 
version 7: improved phylogenetic trees, orthologs and collaboration 
with the Gene Ontology Consortium. Nucleic Acids Res 2010, 
38:D204-D210. 

103. Flutre T, Duprat E, Feuillet C, Quesneville H: Considering transposable 
element diversification in de novo annotation approaches. PfoS One 
2011, 6:el6526. 

104 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment 
search tool. J Mol Biol 1 990, 21 5:403-41 0. 

105. Quesneville H, Nouaud D, Anxolabehere D: Detection of new transposable 
element families in Drosophila melanogaster and Anopheles gambiae 
genomes. J Mol Evol 2003, 57(Suppl 1):S50-S59 

106. Ellinghaus D, Kurtz S, Willhoeft U: LTRhan/est, an efficient and flexible 
software for de novo detection of LTR retrotransposons. 

BMC Blolnformatics 2008, 9:1 8. 

107. Price AL, Jones NC, Pevzner PA: De novo identification of repeat families 
in large genomes. Blolnformatics 2005, 21(Suppl I):i351-i358. 

108. Edgar RC, Myers EW: PILER: identification and classification of genomic 
repeats. Blolnformatics 2005, 21(Suppl l):il 52-11 58. 



Elsik ef al. BMC Genomics 2014, 15:86 
http://www.biomedcentral.com/1471-2164/15/86 



Page 29 of 29 



109. BLASTclust. http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastIab. 
html. 

110 HMMER 3.0. http://hmmer.org. 

111. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, 
Leroy P, Morgante M, Panaud 0, ef al: A unified classification system for 
eukaryotic transposable elements. Nat Rev Genet 2007, 8:973-982. 

112. Benson G: Tandem repeats finder: a program to analyze DNA sequences. 
Nucleic Acids Res 1999, 27:573-580. 

113. Kolpakov R, Bana G, Kucherov G: mreps: Efficient and flexible detection of 
tandem repeats in DNA. Nucleic Adds Res 2003, 31:3672-3678. 

114. IVlarchler-Bauer A, Anderson JB, Derbyshire IVIK, DeWeese-Scott C, Gonzales 
NR, Gwadz IVl, Hao L, He S, HunA/itz Dl, Jackson JD, et al: CDD: a conserved 
domain database for interactive domain family analysis. Nucleic Acids Res 
2007, 35:D237-D240. 

115. Kohany 0, Gentles AJ, Hankus L, Jurka J: Annotation, submission and 
screening of repetitive elements in repbase: repbase submitter and 
censor. BMC Bioinformatlcs 2006, 7:474. 

116. Kapitonov W, Tempel S, Jurka J: Simple and fast classification of non-LTR 
retrotransposons based on phytogeny of their RT domain protein sequences. 
Gene 2009,448:207-213. 

1 1 7. Yuan YW, Wessler SR: The catalytic domain of all eukaryotic cut-and-paste 
transposase superfamilies. Proc Natl Acad Sci USA 201 ^,^ 08:7884-7889. 

118. Kapitonov W, Jurka J: A universal classification of eukaryotic transposable 
elements implemented in Repbase. Nat Rev Genet 2008, 9:41 1-412. 

119. Kodama Y, Shumway M, Leinonen R: The sequence read archive: explosive 
growth of sequencing data. Nucleic Acids Res 201 2, 40:D54-D56. 

120. Apis mellifera 454 Transcriptome BioProjects. httpy/www.ncbi.nlm.nih. 
gov/bioproject/?term=PRJ N A5 1 48 1 +OR+PRJ N A5 1 483. 

121. BeeBase. http://hymenopteragenome.org/beebase/ 



doi:10.1 186/1471-2164-15-86 

Cite this article as: Elsik et al: Finding the missing honey bee genes: 
lessons learned from a genome upgrade. SMC Genomics 2014 15:86. 



Submit your next manuscript to BiolVled Central 
and tal<e full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



^3 BioMed Central 



